DickJC123 opened a new pull request #12804: CudnnFind() usage improvements
URL: https://github.com/apache/incubator-mxnet/pull/12804
 
 
   ## Description ##
   
   This PR improves the MXNet's use of cudnnFind() to address a few issues:
       1. With the gluon imperative style, cudnnFind() is called during 
forward(), and so might have its timings perturbed by other GPU activity 
(including potentially other cudnnFind() calls).
       2. With some cuda drivers versions, care is needed to ensure that the 
large I/O and workspace cudaMallocs() performed by cudnnFind() are immediately 
released and available to MXNet.
       3. cudnnFind() makes both conv I/O and workspace allocations that must 
be covered by the GPU global memory headroom defined by 
MXNET_GPU_MEM_POOL_RESERVE.  Per issue #12662, large convolutions can result in 
out-of-memory errors, even when MXNet's storage allocator has free memory in 
its pool.
   
   This PR addresses these issues, providing the following benefits:
       1. Consistent algo choice for a given convolution type in a model, both 
for instances in the same GPU and in other GPUs in a multi-GPU training setting.
       2. Consistent algo choice from run to run, based on eliminating sources 
of interference of the cudnnFind() timing process.
       3. Consistent model global memory footprint, both because of the 
consistent algo choice (algo's can have markedly different workspace 
requirements) and changes to MXNet's use of cudaMalloc.
       4. Increased training performance based on being able to consistently 
run with models that approach the GPU's full global memory footprint.
       5.  Adds a unittest for and solves issue #12662.
   
   Now specifically, the PR makes the following changes/additions:
       1. Merges the cudnn algo registry separate Find() and Register() methods 
into a single FindOrElseRegister() call protected by a mutex.  Before, there 
was a window of time between when a thread found no algo entry and when it 
registered a new entry.  This window could allow a different thread to call 
cudnnFind() on the same convolution type, getting a different answer.  
FindOrElseRegister() takes a callback that determines the algo via cudnnFind() 
and makes access to the cudnn algo registry effectively atomic.  The callbacks 
provided with cudnn_convolution and cudnn_deconvolution also grab the storage 
allocator's mutex to quiet the GPU activity from other threads during the 
running of cudnnFind().
       2. Changes MXNet's storage allocator to round large allocations to a 
multiple of 2MB.  This value ensures that small permanent allocations that 
might have shared a cuda memory allocation slab with a prior large allocation 
don't prevent the reclaiming of that slab area.
       3. Includes use of the MXNet storage allocator's Alloc() and 
DirectFree() API to coax a convolution's I/O storage needs away from MXNet's 
storage allocator prior to calling cudnnFind().
   
   The first commit introduces a unittest that demonstrates issue #12662 in the 
current codebase.  In order to make the test adaptable to GPUs of varying 
global memory, a C-level API was brought out to the python API by introducing 
mx.context.gpu_memory_info(dev_id).  Follow-up commits supply the fix and the 
rest of the PR. 
   
   Hopefully Interested parties: @KellenSunderland @piiswrong @eric-haibin-lin 
@zheng-da
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [ ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to 
the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) 
created (except PRs with tiny changes)
   - [ X] Changes are complete (i.e. I finished coding on this PR)
   - [X ] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding 
a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing 
distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a 
new build option with NCCL)
   - [X ] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - For new examples, README.md is added to explain the what the example does, 
the source of the dataset, expected performance on test set and reference to 
the original paper if applicable
   - Check the API doc at 
http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
   - [X ] To the my best knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [ ] Feature1, tests, (and when applicable, API doc)
   - [ ] Feature2, tests, (and when applicable, API doc)
   
   ## Comments ##
   
   The following command was run on an 8-GPU system to demonstrate the new 
consistency of global memory footprint:
   ```
   python /opt/mxnet/example/image-classification/train_imagenet.py --gpu 
0,1,2,3,4,5,6,7 --batch-size 256 --num-epochs 1 --data-train 
/data/iman-val-recordio-352/train.rec --disp-batches 10 --network inception-v3 
--data-nthreads 32 --image-shape 3,299,299
   ```
   Two nvidia-smi outputs are shown.  The first is before the PR (showing 
footprints in the range of 5GB to 8GB), while the second is after (showing 
consistent 5GB footprints with < 1% variation).
   
   Before PR:
   ```
   
+-----------------------------------------------------------------------------+
   | NVIDIA-SMI 410.45                 Driver Version: 410.45                   
 |
   
|-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. 
ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute 
M. |
   
|===============================+======================+======================|
   |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    
0 |
   | N/A   58C    P0   247W / 300W |   6312MiB / 32480MiB |     86%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    
0 |
   | N/A   60C    P0   274W / 300W |   5542MiB / 32480MiB |     86%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    
0 |
   | N/A   63C    P0   255W / 300W |   5152MiB / 32480MiB |     89%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    
0 |
   | N/A   54C    P0   258W / 300W |   6730MiB / 32480MiB |     88%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    
0 |
   | N/A   57C    P0   222W / 300W |   5158MiB / 32480MiB |     87%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    
0 |
   | N/A   59C    P0   260W / 300W |   5168MiB / 32480MiB |     84%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    
0 |
   | N/A   63C    P0   266W / 300W |   5160MiB / 32480MiB |     88%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    
0 |
   | N/A   58C    P0   255W / 300W |   8252MiB / 32480MiB |     90%      
Default |
   
+-------------------------------+----------------------+----------------------+
                                                                                
  
   
+-----------------------------------------------------------------------------+
   | Processes:                                                       GPU 
Memory |
   |  GPU       PID   Type   Process name                             Usage     
 |
   
|=============================================================================|
   |    0     53319      C   python                                      
6289MiB |
   |    1     53319      C   python                                      
5519MiB |
   |    2     53319      C   python                                      
5129MiB |
   |    3     53319      C   python                                      
6707MiB |
   |    4     53319      C   python                                      
5135MiB |
   |    5     53319      C   python                                      
5145MiB |
   |    6     53319      C   python                                      
5137MiB |
   |    7     53319      C   python                                      
8229MiB |
   
+-----------------------------------------------------------------------------+
   ```
   After PR:
   ```
   
+-----------------------------------------------------------------------------+
   | NVIDIA-SMI 410.45                 Driver Version: 410.45                   
 |
   
|-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. 
ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute 
M. |
   
|===============================+======================+======================|
   |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    
0 |
   | N/A   55C    P0   271W / 300W |   5044MiB / 32480MiB |     85%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    
0 |
   | N/A   57C    P0   221W / 300W |   5042MiB / 32480MiB |     87%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    
0 |
   | N/A   59C    P0   214W / 300W |   5042MiB / 32480MiB |     84%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    
0 |
   | N/A   52C    P0   263W / 300W |   5040MiB / 32480MiB |     82%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    
0 |
   | N/A   55C    P0   252W / 300W |   5048MiB / 32480MiB |     82%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    
0 |
   | N/A   56C    P0   180W / 300W |   5056MiB / 32480MiB |     81%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    
0 |
   | N/A   59C    P0   244W / 300W |   5060MiB / 32480MiB |     79%      
Default |
   
+-------------------------------+----------------------+----------------------+
   |   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    
0 |
   | N/A   54C    P0   255W / 300W |   5084MiB / 32480MiB |     83%      
Default |
   
+-------------------------------+----------------------+----------------------+
                                                                                
  
   
+-----------------------------------------------------------------------------+
   | Processes:                                                       GPU 
Memory |
   |  GPU       PID   Type   Process name                             Usage     
 |
   
|=============================================================================|
   |    0     47478      C   python                                      
5021MiB |
   |    1     47478      C   python                                      
5019MiB |
   |    2     47478      C   python                                      
5019MiB |
   |    3     47478      C   python                                      
5017MiB |
   |    4     47478      C   python                                      
5025MiB |
   |    5     47478      C   python                                      
5033MiB |
   |    6     47478      C   python                                      
5037MiB |
   |    7     47478      C   python                                      
5061MiB |
   
+-----------------------------------------------------------------------------+
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to