ianferreira opened a new issue #14034: err == cudaSuccess (4 vs. 0) Name: MapPlanKernel ErrStr:unspecified launch failure when training URL: https://github.com/apache/incubator-mxnet/issues/14034 ## Description err == cudaSuccess (4 vs. 0) Name: MapPlanKernel ErrStr:unspecified launch failure when training Alexnet ## Environment info (Required) ----------Python Info---------- Version : 3.6.7 Compiler : MSC v.1900 64 bit (AMD64) Build : ('v3.6.7:6ec5cf24b7', 'Oct 20 2018 13:35:33') Arch : ('64bit', 'WindowsPE') ------------Pip Info----------- Version : 18.1 Directory : C:\Users\ianfe\envs\mxnet\lib\site-packages\pip ----------MXNet Info----------- Version : 1.3.1 Directory : C:\Users\ianfe\envs\mxnet\lib\site-packages\mxnet Hashtag not found. Not installed from pre-built package. ----------System Info---------- Platform : Windows-10-10.0.17763-SP0 system : Windows node : DESKTOP-RNUS3LP release : 10 version : 10.0.17763 ----------Hardware Info---------- machine : AMD64 processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD Name AMD Ryzen Threadripper 1950X 16-Core Processor ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0170 sec, LOAD: 1.6175 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0590 sec, LOAD: 0.2110 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0440 sec, LOAD: 0.1160 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0170 sec, LOAD: 0.0920 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0140 sec, LOAD: 0.3180 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0150 sec, LOAD: 0.0430 sec. Package used (Python/R/Scala/Julia): ## Error Message: Traceback (most recent call last): File "train_alexnet.py", line 111, in <module> epoch_end_callback=epochEndCBs) File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\model.py", line 893, in fit sym_gen=self.sym_gen) File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\model.py", line 325, in _train_multi_device executor_manager.update_metric(eval_metric, data_batch.label) File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\executor_manager.py", line 444, in update_metric self.curr_execgrp.update_metric(metric, labels, pre_sliced) File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\executor_manager.py", line 296, in update_metric metric.update(labels_slice, texec.outputs) File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\metric.py", line 318, in update metric.update(labels, preds) File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\metric.py", line 418, in update pred_label = pred_label.asnumpy().astype('int32') File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\ndarray\ndarray.py", line 1972, in asnumpy ctypes.c_size_t(data.size))) File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\base.py", line 251, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [17:41:47] c:\jenkins\workspace\mxnet-tag\mxnet\3rdparty\mshadow\mshadow\./cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (4 vs. 0) Name: MapPlanKernel ErrStr:unspecified launch failure ## Minimum reproducible example Intermitted failure, but same error. ## Steps to reproduce model = mx.model.FeedForward( ctx=[mx.gpu(0), mx.gpu(1), mx.gpu(2)], symbol=model, initializer=mx.initializer.Xavier(), arg_params=argParams, aux_params=auxParams, optimizer=opt, num_epoch=90, begin_epoch=args["start_epoch"]) print("[INFO] training network...") model.fit( X=trainIter, eval_data=valIter, eval_metric=metrics, batch_end_callback=batchEndCBs, epoch_end_callback=epochEndCBs) ## What have you tried to solve it? 1. Rebooted, restarted script 2. Seems to run for 70 epochs in some cases, other times fails after 5. No consistent repro.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
