chinakook opened a new issue #19577:
URL: https://github.com/apache/incubator-mxnet/issues/19577
## Description
Multiple gpu train error in mxnet 2.0 master (20201123)
### Error Message
```
MXNET_ENABLE_GPU_P2P=0 python example/gluon/image_classification.py
--dataset dummy -j 12 --batch-size 16 --gpus 0,1,2,3 --model resnet101_v1
INFO:root:Starting new image-classification task:,
Namespace(batch_norm=False, batch_size=16, builtin_profiler=0, data_dir='',
dataset='dummy', dtype='float32', epochs=120, gpus='0,1,2,3', kvstore='device',
log_interval=50, lr=0.1, lr_factor=0.1, lr_steps='30,60,90', mode=None,
model='resnet101_v1', momentum=0.9, num_workers=12, prefix='', profile=False,
resume='', save_frequency=10, seed=123, start_epoch=0, use_pretrained=False,
use_thumbnail=False, wd=0.0001)
INFO:root:NumPy-shape semantics has been activated in your code. This is
required for creating and manipulating scalar and zero-size tensors, which were
not supported in MXNet before, as in the official NumPy library. Please DO NOT
manually deactivate this semantics while using `mxnet.numpy` and
`mxnet.numpy_extension` modules.
[13:23:18] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for CPU
[13:23:20] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for GPU
[13:23:22] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for GPU
[13:23:24] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for GPU
[13:23:26] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for GPU
[13:23:27] ../src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running
performance tests to find the best convolution algorithm, this can take a
while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to
disable)
[13:23:27] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for CPU_PINNED
WARNING:root:np.argpartition is a fallback operator, which is actually using
official numpy's implementation.
Traceback (most recent call last):
File "example/gluon/image_classification.py", line 278, in <module>
main()
File "example/gluon/image_classification.py", line 262, in main
train(opt, context)
File "example/gluon/image_classification.py", line 230, in train
metric.update(label, outputs)
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/gluon/metric.py",
line 324, in update
metric.update(labels, preds)
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/util.py",
line 299, in _with_np_shape
return func(*args, **kwargs)
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/util.py",
line 480, in _with_np_array
return func(*args, **kwargs)
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/gluon/metric.py",
line 507, in update
pred_label = numpy.argpartition(pred_label, -self.top_k)
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/fallback.py",
line 122, in wrapper
return obj(*args, **kwargs)
File "<__array_function__ internals>", line 5, in argpartition
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/multiarray.py",
line 380, in __array_function__
new_args, cur_ctx = _as_onp_array(args)
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/multiarray.py",
line 209, in _as_onp_array
arr, tmp_ctx = _as_onp_array(arr)
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/numpy/multiarray.py",
line 205, in _as_onp_array
return object.asnumpy(), object.ctx
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/ndarray/ndarray.py",
line 2600, in asnumpy
check_call(_LIB.MXNDArraySyncCopyToCPU(
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/base.py",
line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "../include/mshadow/./stream_gpu-inl.h", line 91
CUDA: Check failed: e == cudaSuccess (700 vs. 0) : an illegal memory access
was encountered
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/base.py",
line 529, in _notify_shutdown
check_call(_LIB.MXNotifyShutdown())
File
"/home/bluews/anaconda3/envs/mymx/lib/python3.8/site-packages/mxnet/base.py",
line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "../include/mshadow/./stream_gpu-inl.h", line 91
CUDA: Check failed: e == cudaSuccess (700 vs. 0) : an illegal memory access
was encountered
```
## To Reproduce
(If you developed your own code, please provide a short script that
reproduces the error. For existing examples, please provide link.)
### Steps to reproduce
(Paste the commands you ran that produced the error.)
1.
2.
## What have you tried to solve it?
1.
2.
## Environment
***We recommend using our script for collecting the diagnostic information
with the following command***
`curl --retry 10 -s
https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
| python3`
<details>
<summary>Environment Information</summary>
```
# Paste the diagnose.py command output here
```
</details>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]