safrooze opened a new issue #11905: Poor concat CPU performance on CUDA builds URL: https://github.com/apache/incubator-mxnet/issues/11905 ## Description If I use a CUDA build (mxnet-cu90), concat operation is 2.2x slower than using a non-CUDA build (mxnet). I believe the problem is that omp is disabled in mshadow when CUDA is enabled (https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/tensor_cpu-inl.h#L149) ## Environment info (Required) ``` ----------Python Info---------- Version : 3.4.5 Compiler : GCC 4.4.7 20120313 (Red Hat 4.4.7-1) Build : ('default', 'Jul 2 2016 17:47:47') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 18.0 Directory : /home/ec2-user/anaconda3/envs/mxnet_p34/lib/python3.4/site-packages/pip ----------MXNet Info----------- Version : 1.3.0 Directory : /home/ec2-user/anaconda3/envs/mxnet_p34/lib/python3.4/site-packages/mxnet Commit Hash : f5b95b090815e879b57dca233604dcb3f1df967a ----------System Info---------- Platform : Linux-4.9.93-41.60.amzn1.x86_64-x86_64-with-glibc2.2.5 system : Linux node : ip-172-31-73-235 release : 4.9.93-41.60.amzn1.x86_64 version : #1 SMP Fri Apr 13 21:58:27 UTC 2018 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Stepping: 1 CPU MHz: 2691.662 BogoMIPS: 4600.11 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0-7 ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.3027 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0025 sec, LOAD: 0.0957 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0040 sec, LOAD: 0.0293 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0301 sec, LOAD: 0.3765 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0752 sec, LOAD: 0.2172 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1274 sec, LOAD: 0.4291 sec. ``` I'm using Python package. ## Minimum reproducible example ``` ctx = mx.cpu() num_iter = 1000 start = time() for i in range(num_iter): a = nd.empty((1, 512, 120 * 120), ctx=ctx) b = nd.empty((1, 512, 1), ctx=ctx) c = nd.concat(a, b, dim=2) nd.waitall() print('elapsed: {:.2f}'.format(time() - start)) ``` With `mxnet` package, I get `elapsed: 19.42`. With `mxnet-cu90` package, I get `elapsed: 45.97`. ## What have you tried to solve it? Looking at the concat implementation, it uses mshadow uses `MapPlan` which only enables `omp parallel` if MSHADOW_USE_CUDA compiler flag is disabled.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
