apeskov commented on PR #11508:
URL: https://github.com/apache/tvm/pull/11508#issuecomment-1144075748
@crazydemo Answering your question about performance.
> I wonder if we can get better performance via running layernorm on dnnl
codegen than running consecutive ops on native codegen. Could you please
provide some performance numbers?
Yes, there is performance benefit. At least they use different memory access
approach. Consecutive ops with llvm codegen will produce sequence of fused
kernel like next:
* mean. One pass through src tensor
* sub. One pass through src and dst tensor
* power + mean. One pass through src
* add + sqrt + div + mul + add. One pass through src and dst.
Totally we have 6 times traversing through data tensor for TVM codegen. DNNL
implement it as single kernel and do only 4 passes through memory buffers (or 3
in case of in place memory).
In case of multi core system(xeon servers and other) normalise op is memory
bound. And reduction of memory access becomes more important.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]