sxjscience commented on issue #17665: No speedup from using FP16 (4 times 
slower than PyTorch)
URL: 
https://github.com/apache/incubator-mxnet/issues/17665#issuecomment-592204581
 
 
   I tried with `nvprof` and find that MXNet and PyTorch uses different kernels:
   
   For MXNet, it's `volta_fp16_sgemm_fp16_64x64_nn`.
   
   ```
   ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 
test_fp16.py 
   /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning: 
Conversion of the second argument of issubdtype from `float` to `np.floating` 
is deprecated. In future, it will be treated as `np.float64 == 
np.dtype(float).type`.
     from ._conv import register_converters as _register_converters
   ==117922== NVPROF is profiling process 117922, command: python3 test_fp16.py
   57.4354133605957
   ==117922== Profiling application: python3 test_fp16.py
   ==117922== Profiling result:
               Type  Time(%)      Time     Calls       Avg       Min       Max  
Name
    GPU activities:  100.00%  57.3739s       100  573.74ms  572.92ms  594.30ms  
volta_fp16_sgemm_fp16_64x64_nn
                       0.00%  1.7993ms         3  599.78us  599.42us  600.26us  
_ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPN7mshadow4half6half_tEEEEviDpT0_
                       0.00%  190.78us       100  1.9070us  1.7600us  6.8160us  
[CUDA memcpy DtoH]
                       0.00%  19.200us        12  1.6000us  1.5360us  1.9520us  
[CUDA memcpy HtoD]
                       0.00%  11.424us         8  1.4280us  1.4080us  1.4720us  
[CUDA memset]
         API calls:   76.78%  57.3844s       203  282.68ms  6.2690us  594.29ms  
cudaStreamSynchronize
   ```
   
   For PyTorch, it's `volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn`.
   ```
   ubuntu@ip-172-31-27-255:~$ vi test_fp16_pytorch.py
   ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 
test_fp16_pytorch.py 
   ==118113== NVPROF is profiling process 118113, command: python3 
test_fp16_pytorch.py
   8.097127437591553
   ==118113== Profiling application: python3 test_fp16_pytorch.py
   ==118113== Profiling result:
               Type  Time(%)      Time     Calls       Avg       Min       Max  
Name
    GPU activities:   97.29%  8.08549s       100  80.855ms  80.561ms  93.579ms  
volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn
                       2.71%  224.92ms         4  56.231ms  1.9200us  75.214ms  
[CUDA memcpy HtoD]
                       0.00%  186.40us       100  1.8640us  1.6640us  3.9680us  
[CUDA memcpy DtoH]
         API calls:   50.26%  8.30841s       103  80.664ms  74.913ms  93.269ms  
cudaMemcpyAsync
                      49.40%  8.16635s         6  1.36106s  9.3230us  8.16199s  
cudaMalloc
                       0.11%  18.989ms      1528  12.427us     714ns  479.17us  
cuDeviceGetAttribute
                       0.11%  17.890ms        16  1.1181ms  1.0814ms  1.1642ms  
cudaGetDeviceProperties
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to