ptrendx opened a new pull request #14550: Do not touch GPU 0 during ReleaseAll
URL: https://github.com/apache/incubator-mxnet/pull/14550
 
 
   ## Description ##
   Currently, during call to ReleaseAll (either in pooled allocator destructor 
or when the memory is full and the allocator needs to release cached 
allocations) the memory handle is not fully populated (only ptr and size are 
populated, not context information), which makes the DirectFreeNoLock method to 
switch to GPU 0 (since 0 is the default constructed Context's dev_id). In the 
case of multi process training (e.g. with Horovod) this could produce Out of 
Memory errors due to context creation on GPU 0, or it could outright fail, 
crashing the application, if the exclusive mode is set on GPU 0.
   
   This PR fixes that by adding `dev_id_` parameter to pooled storage managers 
and populating the handle with properly constructed Context in ReleaseAll 
method.
   
   @apeforest @yuxihu FYI
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] Code is well-documented: 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - [x] To the my best knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to