cjolivier01 commented on issue #8937: mxnet cpu broadcasting >5x slower than numpy URL: https://github.com/apache/incubator-mxnet/issues/8937#issuecomment-349507459 Resolution: In 0.11, when mxnet was built with CUDA enabled, OMP was not used for most operations in mxnet, even on CPU-only operations. This changed in 0.12. In 1.0, two additional things happened: 1) @piiswrong refactored broadcast operators from using the deprecated mshadow framework to his Kernel/Launch framework 2) Auto-tuning for selecting the most optimal OMP or non-OMP modes based upon thread count, number of data items and operator calculation "weight", thus using OMP for larger datasets and single-thread for smaller datasets (which is often faster). Post 1.0, IntelOMP is built and used by default for cmake builds. These things together make mxnet faster than numpy for most broadcast ops (so far as I have seen so far). Times I get for the above script: PASS 1... (1000, 10000) avg time over 100 trials: 0.0125194501877s (1000, 10000) avg time over 100 trials: 0.0115594983101s PASS 2... (1000, 10000) avg time over 100 trials: 0.0119210195541s (1000, 10000) avg time over 100 trials: 0.0114824414253s So, I think that the issue is resolved in current builds. To get similar performance in your 0.11, you'd probably need to build without CUDA enabled. Please reopen the issue if necessary.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
