[GitHub] [incubator-mxnet] chinakook opened a new issue #19577: Multiple gpu train error

GitBox Sun, 22 Nov 2020 21:26:51 -0800


chinakook opened a new issue #19577:
URL: https://github.com/apache/incubator-mxnet/issues/19577



   ## Description
   Multiple gpu train error in mxnet 2.0 master (20201123)
   
   ### Error Message
   ```
   MXNET_ENABLE_GPU_P2P=0 python example/gluon/image_classification.py 
--dataset dummy -j 12 --batch-size 16 --gpus 0,1,2,3 --model resnet101_v1 
   INFO:root:Starting new image-classification task:, 
Namespace(batch_norm=False, batch_size=16, builtin_profiler=0, data_dir='', 
dataset='dummy', dtype='float32', epochs=120, gpus='0,1,2,3', kvstore='device', 
log_interval=50, lr=0.1, lr_factor=0.1, lr_steps='30,60,90', mode=None, 
model='resnet101_v1', momentum=0.9, num_workers=12, prefix='', profile=False, 
resume='', save_frequency=10, seed=123, start_epoch=0, use_pretrained=False, 
use_thumbnail=False, wd=0.0001)
   INFO:root:NumPy-shape semantics has been activated in your code. This is 
required for creating and manipulating scalar and zero-size tensors, which were 
not supported in MXNet before, as in the official NumPy library. Please DO NOT 
manually deactivate this semantics while using `mxnet.numpy` and 
`mxnet.numpy_extension` modules.
   [13:23:18] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for CPU
   [13:23:20] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for GPU
   [13:23:22] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for GPU
   [13:23:24] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for GPU
   [13:23:26] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for GPU
   [13:23:27] ../src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to 
disable)
   [13:23:27] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for CPU_PINNED
   WARNING:root:np.argpartition is a fallback operator, which is actually using 
official numpy's implementation.
   Traceback (most recent call last):
     File "example/gluon/image_classification.py", line 278, in <module>
       main()
     File "example/gluon/image_classification.py", line 262, in main
       train(opt, context)
     File "example/gluon/image_classification.py", line 230, in train
       metric.update(label, outputs)
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/gluon/metric.py",
 line 324, in update
       metric.update(labels, preds)
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/util.py", 
line 299, in _with_np_shape
       return func(*args, **kwargs)
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/util.py", 
line 480, in _with_np_array
       return func(*args, **kwargs)
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/gluon/metric.py",
 line 507, in update
       pred_label = numpy.argpartition(pred_label, -self.top_k)
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/fallback.py",
 line 122, in wrapper
       return obj(*args, **kwargs)
     File "<__array_function__ internals>", line 5, in argpartition
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/multiarray.py",
 line 380, in __array_function__
       new_args, cur_ctx = _as_onp_array(args)
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/multiarray.py",
 line 209, in _as_onp_array
       arr, tmp_ctx = _as_onp_array(arr)
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/multiarray.py",
 line 205, in _as_onp_array
       return object.asnumpy(), object.ctx
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/ndarray/ndarray.py",
 line 2600, in asnumpy
       check_call(_LIB.MXNDArraySyncCopyToCPU(
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/base.py", 
line 246, in check_call
       raise get_last_ffi_error()
   mxnet.base.MXNetError: Traceback (most recent call last):
     File "../include/mshadow/./stream_gpu-inl.h", line 91
   CUDA: Check failed: e == cudaSuccess (700 vs. 0) : an illegal memory access 
was encountered
   Error in atexit._run_exitfuncs:
   Traceback (most recent call last):
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/base.py", 
line 529, in _notify_shutdown
       check_call(_LIB.MXNotifyShutdown())
     File 
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/base.py", 
line 246, in check_call
       raise get_last_ffi_error()
   mxnet.base.MXNetError: Traceback (most recent call last):
     File "../include/mshadow/./stream_gpu-inl.h", line 91
   CUDA: Check failed: e == cudaSuccess (700 vs. 0) : an illegal memory access 
was encountered
   ```
   
   ## To Reproduce
   (If you developed your own code, please provide a short script that 
reproduces the error. For existing examples, please provide link.)
   
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1.
   2.
   
   ## What have you tried to solve it?
   
   1.
   2.
   
   ## Environment
   
   ***We recommend using our script for collecting the diagnostic information 
with the following command***
   `curl --retry 10 -s 
https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
 | python3`
   
   <details>
   <summary>Environment Information</summary>
   
   ```
   # Paste the diagnose.py command output here
   ```
   
   </details>
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-mxnet] chinakook opened a new issue #19577: Multiple gpu train error

Reply via email to