yifeim opened a new issue #13470: revive mxnet from malloc errors plus an 
interesting usage pitfall
URL: https://github.com/apache/incubator-mxnet/issues/13470
 
 
   ## Description
   Mxnet often appears dead after malloc errors from either user's inaccurate 
memory estimation or [normal 
usage](https://github.com/apache/incubator-mxnet/issues/10453). However, this 
is not entirely the case; we observed a pattern that can "revive" mxnet from 
malloc errors. It is worth investigating the pattern to potentially allow 
people to try-catch malloc errors and retry with different settings.
   
   PS: The minimal example includes an interesting usage pitfall.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.6.5
   Compiler     : GCC 7.2.0
   Build        : ('default', 'Apr 29 2018 16:14:56')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 10.0.1
   Directory    : 
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.4.0
   Directory    : 
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
   Commit Hash   : 0eacdb327abb8144c4d8da49c4d80765bf7b7f96
   ----------System Info----------
   Platform     : Linux-4.14.77-70.59.amzn1.x86_64-x86_64-with-glibc2.9
   system       : Linux
   node         : ip-172-16-95-144
   release      : 4.14.77-70.59.amzn1.x86_64
   version      : #1 SMP Mon Nov 12 22:02:45 UTC 2018
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 
sec, LOAD: 0.3728 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1171 sec, LOAD: 
0.1566 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.6782 sec, LOAD: 
0.1020 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0085 sec, LOAD: 0.1203 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0151 sec, LOAD: 
0.3625 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0101 sec, 
LOAD: 0.0954 sec.
   ```
   
   Package used (Python/R/Scala/Julia): Python
   
   ## Error Message:
   First message:
   ```
   MXNetError: [23:32:52] src/storage/./pooled_storage_manager.h:143: 
cudaMalloc failed: out of memory
   ```
   Subsequent message where mxnet is dead despite the operations are totally 
fine to carry out.
   ```
   MXNetError: [23:32:52] 
src/operator/contrib/./.././../common/../operator/mxnet_op.h:680: Check failed: 
err == cudaSuccess (2 vs. 0) Name: mxnet_generic_kernel ErrStr:out of memory
   ```
   
   ## Minimum reproducible example
   In ipython environment (e.g., jupyter notebook)
   ```ipython
   import mxnet as mx
   from mxnet import gluon
   loss = gluon.loss.SoftmaxCELoss()
   
   # a usage pitfall that leads to malloc errors (can you find the actual 
mistake?)
   loss(
       mx.nd.ones((100000,1), ctx=mx.gpu()),
       mx.nd.ones((100000,), ctx=mx.gpu()),
       mx.nd.ones((100000,), ctx=mx.gpu()),
   )
   # MXNetError: [23:32:52] src/storage/./pooled_storage_manager.h:143: 
cudaMalloc failed: out of memory
   
   # an okay command that cannot run because mxnet is dead
   loss(
       mx.nd.ones((100000,1), ctx=mx.gpu()),
       mx.nd.ones((100000,), ctx=mx.gpu()),
       mx.nd.ones((100000,1), ctx=mx.gpu()),
   )
   # MXNetError: [23:32:52] 
src/operator/contrib/./.././../common/../operator/mxnet_op.h:680: Check failed: 
err == cudaSuccess (2 vs. 0) Name: mxnet_generic_kernel ErrStr:out of memory
   
   # the magic command to revive mxnet in ipython environments
   !ls
   
   # now the okay command can run
   loss(
       mx.nd.ones((100000,1), ctx=mx.gpu()),
       mx.nd.ones((100000,), ctx=mx.gpu()),
       mx.nd.ones((100000,1), ctx=mx.gpu()),
   )
   # [ 0.  0.  0. ...,  0.  0.  0.]
   # <NDArray 100000 @gpu(0)>
   ```
   
   ## Steps to reproduce
   
   1. use jupyter notebook or ipython
   2. run the commands
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to