DickJC123 edited a comment on issue #14006: Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. URL: https://github.com/apache/incubator-mxnet/pull/14006#issuecomment-464276100 I've rerun some perf analysis of this PR, which I'll remind everyone changes nothing in the default. However, when I set MXNET_GPU_WORKER_NSTREAMS=2, I see higher performance for all batchsizes. The perf gains I measured on a run across 8 Volta GPUs of Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container) were: ``` batchsize 32: 6.0% speedup batchsize 64: 0.8% speedup batchsize 128: 1.6% speedup batchsize 256: 0.4% speedup ``` The primary application area of the PR is one of scale-out training across multiple nodes, where a too-large global batchsize can impact final accuracy (thus driving per-GPU batchsize down). The RN50 global memory increase was from 1.4% (bs 32) to 2.6% (bs 256). This work is no longer "in progress." Requesting final review thanks. @szha @marcoabreu
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
