ianferreira opened a new issue #14034: err == cudaSuccess (4 vs. 0) Name: 
MapPlanKernel ErrStr:unspecified launch failure when training 
URL: https://github.com/apache/incubator-mxnet/issues/14034
 
 
   ## Description
   err == cudaSuccess (4 vs. 0) Name: MapPlanKernel ErrStr:unspecified launch 
failure when training Alexnet
   
   ## Environment info (Required)
   ----------Python Info----------
   Version      : 3.6.7
   Compiler     : MSC v.1900 64 bit (AMD64)
   Build        : ('v3.6.7:6ec5cf24b7', 'Oct 20 2018 13:35:33')
   Arch         : ('64bit', 'WindowsPE')
   ------------Pip Info-----------
   Version      : 18.1
   Directory    : C:\Users\ianfe\envs\mxnet\lib\site-packages\pip
   ----------MXNet Info-----------
   Version      : 1.3.1
   Directory    : C:\Users\ianfe\envs\mxnet\lib\site-packages\mxnet
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Windows-10-10.0.17763-SP0
   system       : Windows
   node         : DESKTOP-RNUS3LP
   release      : 10
   version      : 10.0.17763
   ----------Hardware Info----------
   machine      : AMD64
   processor    : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
   Name
   AMD Ryzen Threadripper 1950X 16-Core Processor
   
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0170 
sec, LOAD: 1.6175 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0590 sec, LOAD: 
0.2110 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0440 sec, LOAD: 
0.1160 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0170 sec, LOAD: 0.0920
   sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0140 sec, LOAD: 
0.3180 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0150 sec, 
LOAD: 0.0430 sec.
   Package used (Python/R/Scala/Julia):
   
   ## Error Message:
   Traceback (most recent call last):
     File "train_alexnet.py", line 111, in <module>
       epoch_end_callback=epochEndCBs)
     File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\model.py", line 
893, in fit
       sym_gen=self.sym_gen)
     File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\model.py", line 
325, in _train_multi_device
       executor_manager.update_metric(eval_metric, data_batch.label)
     File 
"C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\executor_manager.py", line 
444, in update_metric
       self.curr_execgrp.update_metric(metric, labels, pre_sliced)
     File 
"C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\executor_manager.py", line 
296, in update_metric
       metric.update(labels_slice, texec.outputs)
     File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\metric.py", line 
318, in update
       metric.update(labels, preds)
     File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\metric.py", line 
418, in update
       pred_label = pred_label.asnumpy().astype('int32')
     File 
"C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\ndarray\ndarray.py", line 
1972, in asnumpy
       ctypes.c_size_t(data.size)))
     File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\base.py", line 
251, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [17:41:47] 
c:\jenkins\workspace\mxnet-tag\mxnet\3rdparty\mshadow\mshadow\./cuda/tensor_gpu-inl.cuh:110:
 Check failed: err == cudaSuccess (4 vs. 0) Name: MapPlanKernel 
ErrStr:unspecified launch failure
   
   ## Minimum reproducible example
   Intermitted failure, but same error.
   
   ## Steps to reproduce
   
   model = mx.model.FeedForward(
        ctx=[mx.gpu(0), mx.gpu(1), mx.gpu(2)],
        symbol=model,
        initializer=mx.initializer.Xavier(),
        arg_params=argParams,
        aux_params=auxParams,
        optimizer=opt,
        num_epoch=90,
        begin_epoch=args["start_epoch"])
   
   print("[INFO] training network...")
   model.fit(
        X=trainIter,
        eval_data=valIter,
        eval_metric=metrics,
        batch_end_callback=batchEndCBs,
        epoch_end_callback=epochEndCBs)
   
   ## What have you tried to solve it?
   
   1. Rebooted, restarted script
   2. Seems to run for 70 epochs in some cases, other times fails after 5. No 
consistent repro.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to