DickJC123 edited a comment on issue #14006: Dual stream cudnn Convolution 
backward() with MXNET_GPU_WORKER_NSTREAMS=2.
URL: https://github.com/apache/incubator-mxnet/pull/14006#issuecomment-464276100
 
 
   I've rerun some perf analysis of this PR, which I'll remind everyone changes 
nothing in the default.  However, when I set MXNET_GPU_WORKER_NSTREAMS=2, I see 
higher performance for all batchsizes.  The perf gains I measured on a run 
across 8 Volta GPUs of Resnet50 v1b (also with horovod and DALI in NVIDIA's 
MXNet container) were:
   ```
   batchsize  32: 6.0% speedup
   batchsize  64: 0.8% speedup
   batchsize 128: 1.6% speedup
   batchsize 256: 0.4% speedup
   ```
   The primary application area of the PR is one of scale-out training across 
multiple nodes, where a too-large global batchsize can impact final accuracy 
(thus driving per-GPU batchsize down).  The RN50 global memory increase was 
from 1.4% (bs 32) to 2.6% (bs 256).
   
   This work is no longer "in progress."  Requesting final review thanks. @szha 
@marcoabreu 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to