DickJC123 opened a new pull request, #21104:
URL: https://github.com/apache/incubator-mxnet/pull/21104

   ## Description ##
   PR #20635, which began using the cuDNN v8 backend API for Convolution ops, 
includes the following line to avoid `test_gluon_gpu.py::test_group_conv2d_16c` 
failures that began occurring coincident with the PR:
   
   ```
   @with_environment('MXNET_CUDNN_DISABLED_CONV_FWD_ENGINES', '5')  # eng:5 
causes test failure on M60
   ```
   This PR will remove that line by providing a different implementation of the 
"convolution plan cache" introduced with PR #20635 that is compatible with 
convolution engine 5.  The steps of this PR will be:
   
   1. Reenable convolution engine 5, and demonstrate a return of 
`test_group_conv2d` failures, then
   2. Add the upgrade to the convolution plan cache so that `test_group_conv2d` 
passes even with engine 5 use
   
   Further detail:
   
   The cuDNN v8 backend allows one to bypass a lot of CPU processing that might 
precede kernel launch by first building up and finalizing a convolution 
execution plan.  The plan is then executed efficiently by the call 
`cudnnBackendExecute(cudnn_handle, plan, ...)`.  PR #20635 introduced a cache 
of plans so that autotuning does not need to be repeated for 
identically-parameterized convolutions, and so that such convolutions are 
handled by the same engine even if they exist multiple times in a model or are 
handled by different GPU workers.
   
   The issue that was discovered for convolution engine 5 is that it caches a 
cuDNN handle provided during the plan's construction, and does not consider the 
handle passed as an argument of cudnnBackendExecute().  The result is that the 
engine's kernels are launched into the stream of the cached handle, and this 
would be the incorrect stream if the GPU worker launching the plan is different 
from the one that created the plan.  Without the proper stream synchronization, 
incorrect results may follow.
   
   The contribution of this PR is to effectively include a GPU worker's cudnn 
handle as part of the key used in the cache lookup.  One aspect of the fix 
though is that if there's a cache miss, an entry made by a different worker can 
be 'cloned' with the proper handle without repeating the autotuning.
   
   ## Checklist ##
   ### Essentials ###
   - [X] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], 
[FEATURE], [DOC], etc)
   - [ ] Changes are complete (i.e. I finished coding on this PR)
   - [X] All changes have test coverage
   - [X] Code is well-documented
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to