stu1130 commented on issue #14725: Performance Regression on CUDA10
URL: 
https://github.com/apache/incubator-mxnet/issues/14725#issuecomment-486016229
 
 
   Rerun the minimal reproducible script shown above. Set the num = 100000 and 
got the result along with nvprof -s
   **cu100mkl**
   ```
   GPU activities:   35.43%  7.72535s     99995  77.257us  71.167us  83.295us  
volta_sgemm_32x32_sliced1x4_nn
                      29.99%  6.53866s     99995  65.389us  62.623us  72.287us  
volta_sgemm_128x64_nt
                      16.11%  3.51241s    199990  17.562us  7.2000us  49.471us  
[CUDA memcpy DtoH]
                      13.52%  2.94757s     99995  29.477us  27.872us  34.559us  
volta_sgemm_64x32_sliced1x4_tn
   ...
   Average: 0.001091881209394027
   Total: 109.18266153335571
   ```
   **cu92mkl**
   ```
   GPU activities:   44.88%  7.94254s     99995  79.429us  75.583us  84.703us  
volta_sgemm_32x32_sliced1x4_nn
                      19.34%  3.42300s    199990  17.115us  7.2950us  58.656us  
[CUDA memcpy DtoH]
                      17.95%  3.17554s     99995  31.757us  29.952us  38.655us  
volta_sgemm_32x32_sliced1x4_tn
                      12.94%  2.28917s     99995  22.892us  20.927us  29.280us  
volta_sgemm_128x64_nt
   ...
   Average: 0.0009327297395715428
   Total: 93.26831030845642
   ```
   We can find **volta_sgemm_128x64_nt** on CUDA 9.2 took almost 3 times than 
CUDA 10. The reason why the total time it takes is simliar is that 
volta_sgemm_32x32_sliced1x4_nn takes most of the excution time and it's not the 
case in LSTM.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to