DickJC123 opened a new pull request #12804: CudnnFind() usage improvements URL: https://github.com/apache/incubator-mxnet/pull/12804 ## Description ## This PR improves the MXNet's use of cudnnFind() to address a few issues: 1. With the gluon imperative style, cudnnFind() is called during forward(), and so might have its timings perturbed by other GPU activity (including potentially other cudnnFind() calls). 2. With some cuda drivers versions, care is needed to ensure that the large I/O and workspace cudaMallocs() performed by cudnnFind() are immediately released and available to MXNet. 3. cudnnFind() makes both conv I/O and workspace allocations that must be covered by the GPU global memory headroom defined by MXNET_GPU_MEM_POOL_RESERVE. Per issue #12662, large convolutions can result in out-of-memory errors, even when MXNet's storage allocator has free memory in its pool. This PR addresses these issues, providing the following benefits: 1. Consistent algo choice for a given convolution type in a model, both for instances in the same GPU and in other GPUs in a multi-GPU training setting. 2. Consistent algo choice from run to run, based on eliminating sources of interference of the cudnnFind() timing process. 3. Consistent model global memory footprint, both because of the consistent algo choice (algo's can have markedly different workspace requirements) and changes to MXNet's use of cudaMalloc. 4. Increased training performance based on being able to consistently run with models that approach the GPU's full global memory footprint. 5. Adds a unittest for and solves issue #12662. Now specifically, the PR makes the following changes/additions: 1. Merges the cudnn algo registry separate Find() and Register() methods into a single FindOrElseRegister() call protected by a mutex. Before, there was a window of time between when a thread found no algo entry and when it registered a new entry. This window could allow a different thread to call cudnnFind() on the same convolution type, getting a different answer. FindOrElseRegister() takes a callback that determines the algo via cudnnFind() and makes access to the cudnn algo registry effectively atomic. The callbacks provided with cudnn_convolution and cudnn_deconvolution also grab the storage allocator's mutex to quiet the GPU activity from other threads during the running of cudnnFind(). 2. Changes MXNet's storage allocator to round large allocations to a multiple of 2MB. This value ensures that small permanent allocations that might have shared a cuda memory allocation slab with a prior large allocation don't prevent the reclaiming of that slab area. 3. Includes use of the MXNet storage allocator's Alloc() and DirectFree() API to coax a convolution's I/O storage needs away from MXNet's storage allocator prior to calling cudnnFind(). The first commit introduces a unittest that demonstrates issue #12662 in the current codebase. In order to make the test adaptable to GPUs of varying global memory, a C-level API was brought out to the python API by introducing mx.context.gpu_memory_info(dev_id). Follow-up commits supply the fix and the rest of the PR. Hopefully Interested parties: @KellenSunderland @piiswrong @eric-haibin-lin @zheng-da ## Checklist ## ### Essentials ### Please feel free to remove inapplicable items for your PR. - [ ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) created (except PRs with tiny changes) - [ X] Changes are complete (i.e. I finished coding on this PR) - [X ] All changes have test coverage: - Unit tests are added for small changes to verify correctness (e.g. adding a new operator) - Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore) - Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL) - [X ] Code is well-documented: - For user-facing API changes, API doc string has been updated. - For new C++ functions in header files, their functionalities and arguments are documented. - For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable - Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html - [X ] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change ### Changes ### - [ ] Feature1, tests, (and when applicable, API doc) - [ ] Feature2, tests, (and when applicable, API doc) ## Comments ## The following command was run on an 8-GPU system to demonstrate the new consistency of global memory footprint: ``` python /opt/mxnet/example/image-classification/train_imagenet.py --gpu 0,1,2,3,4,5,6,7 --batch-size 256 --num-epochs 1 --data-train /data/iman-val-recordio-352/train.rec --disp-batches 10 --network inception-v3 --data-nthreads 32 --image-shape 3,299,299 ``` Two nvidia-smi outputs are shown. The first is before the PR (showing footprints in the range of 5GB to 8GB), while the second is after (showing consistent 5GB footprints with < 1% variation). Before PR: ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.45 Driver Version: 410.45 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 58C P0 247W / 300W | 6312MiB / 32480MiB | 86% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 60C P0 274W / 300W | 5542MiB / 32480MiB | 86% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 | | N/A 63C P0 255W / 300W | 5152MiB / 32480MiB | 89% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 | | N/A 54C P0 258W / 300W | 6730MiB / 32480MiB | 88% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 | | N/A 57C P0 222W / 300W | 5158MiB / 32480MiB | 87% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 | | N/A 59C P0 260W / 300W | 5168MiB / 32480MiB | 84% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 63C P0 266W / 300W | 5160MiB / 32480MiB | 88% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 58C P0 255W / 300W | 8252MiB / 32480MiB | 90% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 53319 C python 6289MiB | | 1 53319 C python 5519MiB | | 2 53319 C python 5129MiB | | 3 53319 C python 6707MiB | | 4 53319 C python 5135MiB | | 5 53319 C python 5145MiB | | 6 53319 C python 5137MiB | | 7 53319 C python 8229MiB | +-----------------------------------------------------------------------------+ ``` After PR: ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.45 Driver Version: 410.45 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 55C P0 271W / 300W | 5044MiB / 32480MiB | 85% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 57C P0 221W / 300W | 5042MiB / 32480MiB | 87% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 | | N/A 59C P0 214W / 300W | 5042MiB / 32480MiB | 84% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 | | N/A 52C P0 263W / 300W | 5040MiB / 32480MiB | 82% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 | | N/A 55C P0 252W / 300W | 5048MiB / 32480MiB | 82% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 | | N/A 56C P0 180W / 300W | 5056MiB / 32480MiB | 81% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 59C P0 244W / 300W | 5060MiB / 32480MiB | 79% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 54C P0 255W / 300W | 5084MiB / 32480MiB | 83% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 47478 C python 5021MiB | | 1 47478 C python 5019MiB | | 2 47478 C python 5019MiB | | 3 47478 C python 5017MiB | | 4 47478 C python 5025MiB | | 5 47478 C python 5033MiB | | 6 47478 C python 5037MiB | | 7 47478 C python 5061MiB | +-----------------------------------------------------------------------------+ ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
