## Problem statement
I noticed that memory pool keeps memory allocated to MXNet process so NDArray 
or tensors can be allocated faster by our pool. At times the pool size becomes 
very large and memory may not be released to the pool immediately once NDArray 
goes out of scope. When I was running Large tensor Nightly Tests(all at once 
sequentially) then I saw certain tests were causing OOM (even on 720GB RAM 
machine, p2.16xl) even when they individually took less than 50GB memory. When 
I added LOG(INFO) to check how much memory MXNet was requesting in bytes it was 
roughly asking for 7500-8500 GB.
Perhaps memory is not being released back to the pool after tensors go out of 
scope or there could be an internal memory fragmentation issue in the pool 
itself. These are my observations from test runs and past experiences of going 
through “pooled_storage_manager”. I will dive deep into it and try to come up 
with a suggestion.

## Proposed solutions
1. Make MXNET_GPU_MEM_POOL_TYPE=Unpooled to also apply to CPU basically use 
similar strategies for CPU memory 
2. Fix fragmentation issue within the pool if any

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19585

Reply via email to