stu1130 commented on issue #14725: Performance Regression on CUDA10 URL: https://github.com/apache/incubator-mxnet/issues/14725#issuecomment-486016229 Rerun the minimal reproducible script shown above. Set the num = 100000 and got the result along with nvprof -s **cu100mkl** ``` GPU activities: 35.43% 7.72535s 99995 77.257us 71.167us 83.295us volta_sgemm_32x32_sliced1x4_nn 29.99% 6.53866s 99995 65.389us 62.623us 72.287us volta_sgemm_128x64_nt 16.11% 3.51241s 199990 17.562us 7.2000us 49.471us [CUDA memcpy DtoH] 13.52% 2.94757s 99995 29.477us 27.872us 34.559us volta_sgemm_64x32_sliced1x4_tn ... Average: 0.001091881209394027 Total: 109.18266153335571 ``` **cu92mkl** ``` GPU activities: 44.88% 7.94254s 99995 79.429us 75.583us 84.703us volta_sgemm_32x32_sliced1x4_nn 19.34% 3.42300s 199990 17.115us 7.2950us 58.656us [CUDA memcpy DtoH] 17.95% 3.17554s 99995 31.757us 29.952us 38.655us volta_sgemm_32x32_sliced1x4_tn 12.94% 2.28917s 99995 22.892us 20.927us 29.280us volta_sgemm_128x64_nt ... Average: 0.0009327297395715428 Total: 93.26831030845642 ``` We can find **volta_sgemm_128x64_nt** on CUDA 9.2 took almost 3 times than CUDA 10. The reason why the total time it takes is simliar is that volta_sgemm_32x32_sliced1x4_nn takes most of the excution time and it's not the case in LSTM.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
