DickJC123 opened a new pull request, #21182: URL: https://github.com/apache/mxnet/pull/21182
## Description ## This provides an updated fix for issue https://github.com/apache/mxnet/issues/19379. To understand this fix, a bit of history is needed: - At some point (possibly the arrival of ubuntu 20.04) the destruction order at program exit of the CUDA context and MXNet's singleton engine became non-deterministic. When the engine destruction occurred after the CUDA context destruction, a segfault would occur due to the release of CUDA resources to a non-existent context. - @ptrendx supplied the fix of not destroying MXNet's Stream objects at exit in master PR https://github.com/apache/mxnet/pull/19378, which also was back-ported to v1.x. - Since that time, improvements to CUDA have made it no longer susceptible to the problem, starting possibly with CUDA 11.2. CUDA 10.2 and 11.0 are confirmed to be susceptible. - A different issue https://github.com/apache/mxnet/issues/20959 was found to be due to the lack of CUDA resource cleanup at exit. As a result, @ptrendx's PR was reverted (on the v1.9.x branch) here https://github.com/apache/mxnet/pull/20998. Due to the improvements in CUDA, most users did not experience the return of the original segfault-at-exit problem. - But clearly, for users still on CUDA 10.2 and 11.0, the segfault behavior has returned (see recent posts to issue https://github.com/apache/mxnet/issues/19379). This PR resupplies the fix of not destroying MXNet's Stream objects, but applies this remedy only when the main Python process is exiting (as detected by `shutdown_phase_ == true`). The destruction of Streams in the dataloader side-processes should not be affected, and so the data memory leak should not resurface. ## Checklist ## ### Essentials ### - [X] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc) - [X] Changes are complete (i.e. I finished coding on this PR) - [ ] All changes have test coverage - [X] Code is well-documented -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
