DickJC123 opened a new pull request, #21182:
URL: https://github.com/apache/mxnet/pull/21182

   ## Description ##
   This provides an updated fix for issue 
https://github.com/apache/mxnet/issues/19379.  To understand this fix, a bit of 
history is needed:
   
   - At some point (possibly the arrival of ubuntu 20.04) the destruction order 
at program exit of the CUDA context and MXNet's singleton engine became 
non-deterministic.  When the engine destruction occurred after the CUDA context 
destruction, a segfault would occur due to the release of CUDA resources to a 
non-existent context.
   - @ptrendx supplied the fix of not destroying MXNet's Stream objects at exit 
in master PR https://github.com/apache/mxnet/pull/19378, which also was 
back-ported to v1.x.
   - Since that time, improvements to CUDA have made it no longer susceptible 
to the problem, starting possibly with CUDA 11.2.  CUDA 10.2 and 11.0 are 
confirmed to be susceptible.
   - A different issue https://github.com/apache/mxnet/issues/20959 was found 
to be due to the lack of CUDA resource cleanup at exit.  As a result, 
@ptrendx's PR was reverted (on the v1.9.x branch) here 
https://github.com/apache/mxnet/pull/20998.  Due to the improvements in CUDA, 
most users did not experience the return of the original segfault-at-exit 
problem.
   - But clearly, for users still on CUDA 10.2 and 11.0, the segfault behavior 
has returned (see recent posts to issue 
https://github.com/apache/mxnet/issues/19379).
   
   This PR resupplies the fix of not destroying MXNet's Stream objects, but 
applies this remedy only when the main Python process is exiting (as detected 
by `shutdown_phase_ == true`). The destruction of Streams in the dataloader 
side-processes should not be affected, and so the data memory leak should not 
resurface.
   
   ## Checklist ##
   ### Essentials ###
   - [X] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], 
[FEATURE], [DOC], etc)
   - [X] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage
   - [X] Code is well-documented
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to