yifeim opened a new issue #13470: revive mxnet from malloc errors plus an interesting usage pitfall URL: https://github.com/apache/incubator-mxnet/issues/13470 ## Description Mxnet often appears dead after malloc errors from either user's inaccurate memory estimation or [normal usage](https://github.com/apache/incubator-mxnet/issues/10453). However, this is not entirely the case; we observed a pattern that can "revive" mxnet from malloc errors. It is worth investigating the pattern to potentially allow people to try-catch malloc errors and retry with different settings. PS: The minimal example includes an interesting usage pitfall. ## Environment info (Required) ``` ----------Python Info---------- Version : 3.6.5 Compiler : GCC 7.2.0 Build : ('default', 'Apr 29 2018 16:14:56') Arch : ('64bit', '') ------------Pip Info----------- Version : 10.0.1 Directory : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip ----------MXNet Info----------- Version : 1.4.0 Directory : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet Commit Hash : 0eacdb327abb8144c4d8da49c4d80765bf7b7f96 ----------System Info---------- Platform : Linux-4.14.77-70.59.amzn1.x86_64-x86_64-with-glibc2.9 system : Linux node : ip-172-16-95-144 release : 4.14.77-70.59.amzn1.x86_64 version : #1 SMP Mon Nov 12 22:02:45 UTC 2018 ----------Hardware Info---------- machine : x86_64 processor : x86_64 ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 0.3728 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1171 sec, LOAD: 0.1566 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.6782 sec, LOAD: 0.1020 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0085 sec, LOAD: 0.1203 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0151 sec, LOAD: 0.3625 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0101 sec, LOAD: 0.0954 sec. ``` Package used (Python/R/Scala/Julia): Python ## Error Message: First message: ``` MXNetError: [23:32:52] src/storage/./pooled_storage_manager.h:143: cudaMalloc failed: out of memory ``` Subsequent message where mxnet is dead despite the operations are totally fine to carry out. ``` MXNetError: [23:32:52] src/operator/contrib/./.././../common/../operator/mxnet_op.h:680: Check failed: err == cudaSuccess (2 vs. 0) Name: mxnet_generic_kernel ErrStr:out of memory ``` ## Minimum reproducible example In ipython environment (e.g., jupyter notebook) ```ipython import mxnet as mx from mxnet import gluon loss = gluon.loss.SoftmaxCELoss() # a usage pitfall that leads to malloc errors (can you find the actual mistake?) loss( mx.nd.ones((100000,1), ctx=mx.gpu()), mx.nd.ones((100000,), ctx=mx.gpu()), mx.nd.ones((100000,), ctx=mx.gpu()), ) # MXNetError: [23:32:52] src/storage/./pooled_storage_manager.h:143: cudaMalloc failed: out of memory # an okay command that cannot run because mxnet is dead loss( mx.nd.ones((100000,1), ctx=mx.gpu()), mx.nd.ones((100000,), ctx=mx.gpu()), mx.nd.ones((100000,1), ctx=mx.gpu()), ) # MXNetError: [23:32:52] src/operator/contrib/./.././../common/../operator/mxnet_op.h:680: Check failed: err == cudaSuccess (2 vs. 0) Name: mxnet_generic_kernel ErrStr:out of memory # the magic command to revive mxnet in ipython environments !ls # now the okay command can run loss( mx.nd.ones((100000,1), ctx=mx.gpu()), mx.nd.ones((100000,), ctx=mx.gpu()), mx.nd.ones((100000,1), ctx=mx.gpu()), ) # [ 0. 0. 0. ..., 0. 0. 0.] # <NDArray 100000 @gpu(0)> ``` ## Steps to reproduce 1. use jupyter notebook or ipython 2. run the commands
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
