cjolivier01 commented on issue #8937: mxnet cpu broadcasting >5x slower than 
numpy
URL: 
https://github.com/apache/incubator-mxnet/issues/8937#issuecomment-349507459
 
 
   Resolution:
   
   In 0.11, when mxnet was built with CUDA enabled, OMP was not used for most 
operations in mxnet, even on CPU-only operations.  This changed in 0.12.
   
   In 1.0, two additional things happened:
     1) @piiswrong refactored broadcast operators from using the deprecated 
mshadow framework to his Kernel/Launch framework
     2) Auto-tuning for selecting the most optimal OMP or non-OMP modes based 
upon thread count, number of data items and operator calculation "weight", thus 
using OMP for larger datasets and single-thread for smaller datasets (which is 
often faster).
   
   Post 1.0, IntelOMP is built and used by default for cmake builds.
   
   These things together make mxnet faster than numpy for most broadcast ops 
(so far as I have seen so far).  Times I get for the above script:
   
   PASS 1...
   (1000, 10000)
   avg time over 100 trials: 0.0125194501877s
   (1000, 10000)
   avg time over 100 trials: 0.0115594983101s
   PASS 2...
   (1000, 10000)
   avg time over 100 trials: 0.0119210195541s
   (1000, 10000)
   avg time over 100 trials: 0.0114824414253s
   
   So, I think that the issue is resolved in current builds.  To get similar 
performance in your 0.11, you'd probably need to build without CUDA enabled.  
   
   Please reopen the issue if necessary.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to