kpuatamazon edited a comment on pull request #19562:
URL: https://github.com/apache/incubator-mxnet/pull/19562#issuecomment-744344661


   I've been using a c5.12xlarge `Intel(R) Xeon(R) Platinum 8275CL CPU @ 
3.00GHz`.  Assume these are some sort of seconds?  
   
   We should at least do `-march=native` to see if it's just a matter of CPU 
support i.e. MXNet doesn't seem to enable AVX512 by default and one could add 
CPUID dispatch.  
   
   Might as well reshape to two dimensions with the axis preserved and 
everything else multiplied.  The problem is identical for e.g. 100x28x10x10x10 
and 280000x10.  Also, those are some really small channels to layer normalize 
over.  
   
   Also, I feel like the optimal assembly implementation would benefit from a 
different ordering of the input tensor to allow for pure vertical adds whereas 
layer normalization is currently setup for horizontal adds.  I can certainly 
see how a JIT will do better at e.g. 1000x3 where multiple problems share the 
same vector.  But oddly that's where marian is doing better.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to