lorenzob commented on issue #17126: Memory should be completely released after an OOM happens URL: https://github.com/apache/incubator-mxnet/issues/17126#issuecomment-569656522 @leezu @ptrendx I used commit d000c3 that I think includes #17114 I no longer get the "Operator is non-differentiable" error if I do not set the MXNET_USE_FUSION. Setting MXNET_USE_FUSION=0 never solved the OOM error. I still get the OOM if I use more than 10 112x112 images in one batch (for inference). When I get the first OOM I'm actually filling all the available free memory (about 3GB) and memory usage remains at near 100% after the first OOM exception (I added a time.sleep and checked this with nvidia-smi). The first OOM and the second one are different: First: ``` File "core/facerec.py", line 184, in compare embedding = model.model.get_outputs()[0].asnumpy() File "/home/trz/github/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2552, in asnumpy ctypes.c_size_t(data.size))) File "/home/trz/github/incubator-mxnet/python/mxnet/base.py", line 278, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [12:42:39] ../src/storage/./pooled_storage_manager.h:161: cudaMalloc retry failed: out of memory ``` Second: ``` File "/home/trz/progetti/zzzz/git_repo/ai-face-matching/arcface/helper.py", line 156, in detect_first_stage output = net.predict(input_buf) File "/home/trz/github/incubator-mxnet/python/mxnet/model.py", line 750, in predict o_list.append(o_nd[0:real_size].asnumpy()) File "/home/trz/github/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2552, in asnumpy ctypes.c_size_t(data.size))) File "/home/trz/github/incubator-mxnet/python/mxnet/base.py", line 278, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [12:39:32] /home/trz/github/incubator-mxnet/include/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
