sxjscience commented on issue #17665: No speedup from using FP16 (4 times slower than PyTorch) URL: https://github.com/apache/incubator-mxnet/issues/17665#issuecomment-592204581 I tried with `nvprof` and find that MXNet and PyTorch uses different kernels: For MXNet, it's `volta_fp16_sgemm_fp16_64x64_nn`. ``` ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 test_fp16.py /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters ==117922== NVPROF is profiling process 117922, command: python3 test_fp16.py 57.4354133605957 ==117922== Profiling application: python3 test_fp16.py ==117922== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 57.3739s 100 573.74ms 572.92ms 594.30ms volta_fp16_sgemm_fp16_64x64_nn 0.00% 1.7993ms 3 599.78us 599.42us 600.26us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPN7mshadow4half6half_tEEEEviDpT0_ 0.00% 190.78us 100 1.9070us 1.7600us 6.8160us [CUDA memcpy DtoH] 0.00% 19.200us 12 1.6000us 1.5360us 1.9520us [CUDA memcpy HtoD] 0.00% 11.424us 8 1.4280us 1.4080us 1.4720us [CUDA memset] API calls: 76.78% 57.3844s 203 282.68ms 6.2690us 594.29ms cudaStreamSynchronize ``` For PyTorch, it's `volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn`. ``` ubuntu@ip-172-31-27-255:~$ vi test_fp16_pytorch.py ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 test_fp16_pytorch.py ==118113== NVPROF is profiling process 118113, command: python3 test_fp16_pytorch.py 8.097127437591553 ==118113== Profiling application: python3 test_fp16_pytorch.py ==118113== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 97.29% 8.08549s 100 80.855ms 80.561ms 93.579ms volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn 2.71% 224.92ms 4 56.231ms 1.9200us 75.214ms [CUDA memcpy HtoD] 0.00% 186.40us 100 1.8640us 1.6640us 3.9680us [CUDA memcpy DtoH] API calls: 50.26% 8.30841s 103 80.664ms 74.913ms 93.269ms cudaMemcpyAsync 49.40% 8.16635s 6 1.36106s 9.3230us 8.16199s cudaMalloc 0.11% 18.989ms 1528 12.427us 714ns 479.17us cuDeviceGetAttribute 0.11% 17.890ms 16 1.1181ms 1.0814ms 1.1642ms cudaGetDeviceProperties ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
