[GitHub] [incubator-mxnet] stu1130 edited a comment on issue #14725: Performance Regression on CUDA10

GitBox Tue, 23 Apr 2019 17:27:01 -0700

stu1130 edited a comment on issue #14725: Performance Regression on CUDA10
URL: 
https://github.com/apache/incubator-mxnet/issues/14725#issuecomment-486016229
 
 
   Rerun the minimal reproducible script shown above. Set the num = 100000 and 
got the result along with nvprof -s
   ```
   # CUDA 10.0 mxnet-cu100mkl
   GPU activities:   35.43%  7.72535s     99995  77.257us  71.167us  83.295us  
volta_sgemm_32x32_sliced1x4_nn
                      29.99%  6.53866s     99995  65.389us  62.623us  72.287us  
volta_sgemm_128x64_nt
                      16.11%  3.51241s    199990  17.562us  7.2000us  49.471us  
[CUDA memcpy DtoH]
                      13.52%  2.94757s     99995  29.477us  27.872us  34.559us  
volta_sgemm_64x32_sliced1x4_tn
   ...
   Average: 0.001091881209394027
   Total: 109.18266153335571
   
------------------------------------------------------------------------------
   # CUDA 9.2 mxnet-cu92mkl
   GPU activities:   44.88%  7.94254s     99995  79.429us  75.583us  84.703us  
volta_sgemm_32x32_sliced1x4_nn
                      19.34%  3.42300s    199990  17.115us  7.2950us  58.656us  
[CUDA memcpy DtoH]
                      17.95%  3.17554s     99995  31.757us  29.952us  38.655us  
volta_sgemm_32x32_sliced1x4_tn
                      12.94%  2.28917s     99995  22.892us  20.927us  29.280us  
volta_sgemm_128x64_nt
   ...
   Average: 0.0009327297395715428
   Total: 93.26831030845642
   ```
   We can find **volta_sgemm_128x64_nt** on CUDA 9.2 took almost 3 times than 
CUDA 10. The reason why the total time it takes is simliar is that 
volta_sgemm_32x32_sliced1x4_nn takes most of the excution time and it's not the 
case in LSTM.
   Provide the other data shape combination script result
   * Note that all of them run only 100 times (num = 100)
   
   ```
   data shape ('640,10000', '640,650', '10000,650')
   # CUDA 10.0 mxnet-cu100mkl
   GPU activities:   54.06%  256.89ms       190  1.3521ms  133.44us  11.855ms  
[CUDA memcpy DtoH]
                      15.03%  71.430ms        95  751.90us  707.26us  837.63us  
volta_sgemm_128x64_nn
                      14.59%  69.329ms        95  729.77us  690.62us  809.12us  
volta_sgemm_128x64_nt
                      12.93%  61.456ms        95  646.91us  568.61us  730.72us  
volta_sgemm_128x64_tn
                       1.33%  6.3357ms        95  66.691us  65.983us  67.487us
   # CUDA 9.2 mxnet-cu92mkl
   GPU activities:   56.68%  295.08ms       190  1.5531ms  133.41us  13.890ms  
[CUDA memcpy DtoH]
                      14.10%  73.427ms        95  772.92us  705.98us  839.36us  
volta_sgemm_128x64_nn
                      13.10%  68.185ms        95  717.74us  655.04us  775.13us  
volta_sgemm_128x64_nt
                      13.00%  67.684ms        95  712.47us  640.54us  778.40us  
volta_sgemm_128x128_tn
   
------------------------------------------------------------------------------
   data shape ('960,10000', '960,650', '10000,650')
   # CUDA 10.0 mxnet-cu100mkl
   GPU activities:   79.61%  1.38828s       190  7.3068ms  289.41us  18.374ms  
[CUDA memcpy DtoH]
                       6.66%  116.15ms        95  1.2226ms  1.2185ms  1.2268ms  
volta_sgemm_128x64_nn
                       6.42%  111.92ms        95  1.1781ms  1.1731ms  1.1876ms  
volta_sgemm_128x64_nt
                       5.98%  104.26ms        95  1.0975ms  1.0937ms  1.1239ms  
volta_sgemm_32x128_tn
                       0.54%  9.4320ms        95  99.284us  98.719us  99.839us
   # CUDA 9.2 mxnet-cu92mkl
   GPU activities:   80.65%  1.45573s       190  7.6617ms  270.69us  18.452ms  
[CUDA memcpy DtoH]
                       6.43%  116.12ms        95  1.2223ms  1.2162ms  1.2279ms  
volta_sgemm_128x64_nn
                       6.03%  108.86ms        95  1.1459ms  1.1444ms  1.1490ms  
volta_sgemm_128x64_nt
                       5.61%  101.24ms        95  1.0657ms  1.0618ms  1.0718ms  
volta_sgemm_128x128_tn
   
------------------------------------------------------------------------------
   data shape ('1600,10000', '1600,650', '10000,650')
   # CUDA 10.0 mxnet-cu100mkl
   GPU activities:   81.71%  2.63850s       190  13.887ms  534.24us  32.062ms  
[CUDA memcpy DtoH]
                       6.48%  209.16ms        95  2.2017ms  2.1817ms  2.3835ms  
volta_sgemm_128x64_nn
                       5.67%  183.19ms        95  1.9283ms  1.9143ms  1.9402ms  
volta_sgemm_128x64_nt
                       5.01%  161.71ms        95  1.7023ms  1.6479ms  1.7094ms  
volta_sgemm_128x64_tn
   # CUDA 9.2 mxnet-cu92mkl
   GPU activities:   81.32%  2.57187s       190  13.536ms  505.56us  32.412ms  
[CUDA memcpy DtoH]
                       6.63%  209.82ms        95  2.2086ms  2.1808ms  2.3828ms  
volta_sgemm_128x64_nn
                       5.71%  180.48ms        95  1.8998ms  1.8987ms  1.9037ms  
volta_sgemm_128x64_nt
                       5.19%  164.03ms        95  1.7266ms  1.7225ms  1.7318ms  
volta_sgemm_128x128_tn
   
------------------------------------------------------------------------------
   data shape ('1280,10000', '1280,650', '10000,650')
   # CUDA 10.0 mxnet-cu100mkl
   GPU activities:   82.14%  2.12338s       190  11.176ms  451.01us  26.345ms  
[CUDA memcpy DtoH]
                       5.94%  153.49ms        95  1.6157ms  1.6120ms  1.6218ms  
volta_sgemm_128x64_nn
                       5.71%  147.61ms        95  1.5538ms  1.5445ms  1.5637ms  
volta_sgemm_128x64_nt
                       5.06%  130.83ms        95  1.3772ms  1.3723ms  1.3809ms  
volta_sgemm_32x128_tn
   # CUDA 9.2 mxnet-cu92mkl
   GPU activities:   82.17%  2.08856s       190  10.992ms  467.93us  26.238ms  
[CUDA memcpy DtoH]
                       5.99%  152.15ms        95  1.6016ms  1.5995ms  1.6043ms  
volta_sgemm_128x64_nn
                       5.70%  144.78ms        95  1.5240ms  1.5219ms  1.5288ms  
volta_sgemm_128x64_nt
                       4.98%  126.54ms        95  1.3320ms  1.3283ms  1.3399ms  
volta_sgemm_128x128_tn
   
------------------------------------------------------------------------------
   data shape ('320,10000', '320,650', '10000,650')
   # CUDA 10.0 mxnet-cu100mkl
   GPU activities:   58.78%  186.02ms       190  979.07us  65.055us  6.4167ms  
[CUDA memcpy DtoH]
                      14.23%  45.045ms        95  474.16us  457.02us  492.99us  
volta_sgemm_128x64_nt
                      12.23%  38.704ms        95  407.41us  348.99us  540.25us  
volta_sgemm_128x128_tn
                      11.78%  37.298ms        95  392.61us  359.84us  424.28us  
volta_sgemm_128x64_nn
   # CUDA 9.2 mxnet-cu92mkl
   GPU activities:   62.66%  207.77ms       190  1.0935ms  65.087us  6.6399ms  
[CUDA memcpy DtoH]
                      11.83%  39.221ms        95  412.86us  349.63us  540.79us  
volta_sgemm_128x128_tn
                      11.48%  38.053ms        95  400.56us  360.64us  424.86us  
volta_sgemm_128x64_nn
                      11.14%  36.947ms        95  388.92us  354.17us  423.29us  
volta_sgemm_128x64_nt
   
------------------------------------------------------------------------------
   data shape ('1920,10000', '1920,650', '10000,650')
   # CUDA 10.0 mxnet-cu100mkl
   GPU activities:   82.80%  3.29555s       190  17.345ms  664.22us  40.348ms  
[CUDA memcpy DtoH]
                       5.70%  227.01ms        95  2.3896ms  2.3852ms  2.3959ms  
volta_sgemm_128x64_nn
                       5.52%  219.56ms        95  2.3111ms  2.2841ms  2.3172ms  
volta_sgemm_128x64_nt
                       4.81%  191.49ms        95  2.0157ms  1.9679ms  2.0367ms  
volta_sgemm_128x64_tn
   # CUDA 9.2 mxnet-cu92mkl
   GPU activities:   80.50%  2.80315s       190  14.753ms  664.44us  36.470ms  
[CUDA memcpy DtoH]
                       6.52%  227.09ms        95  2.3904ms  2.3850ms  2.3956ms  
volta_sgemm_128x64_nn
                       6.21%  216.39ms        95  2.2778ms  2.2743ms  2.2854ms  
volta_sgemm_128x64_nt
                       5.44%  189.46ms        95  1.9943ms  1.9876ms  2.0082ms  
volta_sgemm_128x128_tn
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] stu1130 edited a comment on issue #14725: Performance Regression on CUDA10

Reply via email to