[GitHub] [incubator-mxnet] lorenzob commented on issue #17126: Memory should be completely released after an OOM happens

GitBox Mon, 30 Dec 2019 03:48:34 -0800

lorenzob commented on issue #17126: Memory should be completely released after 
an OOM happens
URL: 
https://github.com/apache/incubator-mxnet/issues/17126#issuecomment-569656522
 
 
   @leezu @ptrendx I used commit d000c3 that I think includes #17114
   
   I no longer get the "Operator is non-differentiable" error if I do not set 
the MXNET_USE_FUSION. Setting MXNET_USE_FUSION=0 never solved the OOM error.
   
   I still get the OOM if I use more than 10 112x112 images in one batch (for 
inference). When I get the first OOM I'm actually filling all the available 
free memory (about 3GB) and memory usage remains at near 100% after the first 
OOM exception (I added a time.sleep and checked this with nvidia-smi).
   
   The first OOM and the second one are different:
   
   First:
   ```
     File "core/facerec.py", line 184, in compare
       embedding = model.model.get_outputs()[0].asnumpy()
     File "/home/trz/github/incubator-mxnet/python/mxnet/ndarray/ndarray.py", 
line 2552, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/home/trz/github/incubator-mxnet/python/mxnet/base.py", line 278, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [12:42:39] 
../src/storage/./pooled_storage_manager.h:161: cudaMalloc retry failed: out of 
memory
   ```
   
   Second:
   ```
     File 
"/home/trz/progetti/zzzz/git_repo/ai-face-matching/arcface/helper.py", line 
156, in detect_first_stage
       output = net.predict(input_buf)
     File "/home/trz/github/incubator-mxnet/python/mxnet/model.py", line 750, 
in predict
       o_list.append(o_nd[0:real_size].asnumpy())
     File "/home/trz/github/incubator-mxnet/python/mxnet/ndarray/ndarray.py", 
line 2552, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/home/trz/github/incubator-mxnet/python/mxnet/base.py", line 278, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [12:39:32] 
/home/trz/github/incubator-mxnet/include/mshadow/././././cuda/tensor_gpu-inl.cuh:110:
 Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of 
memory
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] lorenzob commented on issue #17126: Memory should be completely released after an OOM happens

Reply via email to