safrooze opened a new issue #11905: Poor concat CPU performance on CUDA builds
URL: https://github.com/apache/incubator-mxnet/issues/11905
 
 
   ## Description
   If I use a CUDA build (mxnet-cu90), concat operation is 2.2x slower than 
using a non-CUDA build (mxnet). I believe the problem is that omp is disabled 
in mshadow when CUDA is enabled 
(https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/tensor_cpu-inl.h#L149)
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.4.5
   Compiler     : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
   Build        : ('default', 'Jul  2 2016 17:47:47')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 18.0
   Directory    : 
/home/ec2-user/anaconda3/envs/mxnet_p34/lib/python3.4/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.3.0
   Directory    : 
/home/ec2-user/anaconda3/envs/mxnet_p34/lib/python3.4/site-packages/mxnet
   Commit Hash   : f5b95b090815e879b57dca233604dcb3f1df967a
   ----------System Info----------
   Platform     : Linux-4.9.93-41.60.amzn1.x86_64-x86_64-with-glibc2.2.5
   system       : Linux
   node         : ip-172-31-73-235
   release      : 4.9.93-41.60.amzn1.x86_64
   version      : #1 SMP Fri Apr 13 21:58:27 UTC 2018
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                8
   On-line CPU(s) list:   0-7
   Thread(s) per core:    2
   Core(s) per socket:    4
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
   Stepping:              1
   CPU MHz:               2691.662
   BogoMIPS:              4600.11
   Hypervisor vendor:     Xen
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              46080K
   NUMA node0 CPU(s):     0-7
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 
sec, LOAD: 0.3027 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0025 sec, LOAD: 
0.0957 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0040 sec, 
LOAD: 0.0293 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0301 sec, LOAD: 
0.3765 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0752 sec, LOAD: 0.2172 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1274 sec, LOAD: 
0.4291 sec.
   ```
   I'm using Python package.
   
   ## Minimum reproducible example
   
   ```
   ctx = mx.cpu()
   num_iter = 1000
   start = time()
   for i in range(num_iter):
       a = nd.empty((1, 512, 120 * 120), ctx=ctx)
       b = nd.empty((1, 512, 1), ctx=ctx)
       c = nd.concat(a, b, dim=2)
   nd.waitall()
   print('elapsed: {:.2f}'.format(time() - start))
   ```
   
   With `mxnet` package, I get `elapsed: 19.42`. With `mxnet-cu90` package, I 
get `elapsed: 45.97`.
   
   ## What have you tried to solve it?
   
   Looking at the concat implementation, it uses mshadow uses `MapPlan` which 
only enables `omp parallel` if MSHADOW_USE_CUDA compiler flag is disabled. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to