ptrendx opened a new pull request #14550: Do not touch GPU 0 during ReleaseAll URL: https://github.com/apache/incubator-mxnet/pull/14550 ## Description ## Currently, during call to ReleaseAll (either in pooled allocator destructor or when the memory is full and the allocator needs to release cached allocations) the memory handle is not fully populated (only ptr and size are populated, not context information), which makes the DirectFreeNoLock method to switch to GPU 0 (since 0 is the default constructed Context's dev_id). In the case of multi process training (e.g. with Horovod) this could produce Out of Memory errors due to context creation on GPU 0, or it could outright fail, crashing the application, if the exclusive mode is set on GPU 0. This PR fixes that by adding `dev_id_` parameter to pooled storage managers and populating the handle with properly constructed Context in ReleaseAll method. @apeforest @yuxihu FYI ## Checklist ## ### Essentials ### Please feel free to remove inapplicable items for your PR. - [x] Changes are complete (i.e. I finished coding on this PR) - [x] Code is well-documented: - For new C++ functions in header files, their functionalities and arguments are documented. - [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
