Harold-Zhang opened a new issue #7958: Engine shutdown URL: https://github.com/apache/incubator-mxnet/issues/7958 ## Environment info Operating System: Ubuntu 14.04 Compiler: gcc 4.8.4 Package used (Python/R/Scala/Julia): Python 2.7 MXNet version: The latest version GPU: Tesla K40m ## Error Message: [19:43:24] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded! 2017-09-19 19:43:45,834 - Epoch[0] Batch [20] Speed: 4.56 samples/sec accuracy=0.964286 2017-09-19 19:44:03,379 - Epoch[0] Batch [40] Speed: 4.56 samples/sec accuracy=1.000000 2017-09-19 19:44:20,942 - Epoch[0] Batch [60] Speed: 4.56 samples/sec accuracy=1.000000 2017-09-19 19:44:38,747 - Epoch[0] Batch [80] Speed: 4.49 samples/sec accuracy=1.000000 2017-09-19 19:44:56,319 - Epoch[0] Batch [100] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:45:13,862 - Epoch[0] Batch [120] Speed: 4.56 samples/sec accuracy=1.000000 2017-09-19 19:45:31,494 - Epoch[0] Batch [140] Speed: 4.54 samples/sec accuracy=1.000000 2017-09-19 19:45:49,110 - Epoch[0] Batch [160] Speed: 4.54 samples/sec accuracy=1.000000 2017-09-19 19:46:06,677 - Epoch[0] Batch [180] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:46:24,257 - Epoch[0] Batch [200] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:46:41,886 - Epoch[0] Batch [220] Speed: 4.54 samples/sec accuracy=1.000000 2017-09-19 19:46:59,501 - Epoch[0] Batch [240] Speed: 4.54 samples/sec accuracy=1.000000 2017-09-19 19:47:17,085 - Epoch[0] Batch [260] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:47:34,667 - Epoch[0] Batch [280] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:47:52,273 - Epoch[0] Batch [300] Speed: 4.54 samples/sec accuracy=1.000000 2017-09-19 19:48:09,861 - Epoch[0] Batch [320] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:48:27,503 - Epoch[0] Batch [340] Speed: 4.53 samples/sec accuracy=1.000000 2017-09-19 19:48:45,085 - Epoch[0] Batch [360] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:49:02,700 - Epoch[0] Batch [380] Speed: 4.54 samples/sec accuracy=1.000000 2017-09-19 19:49:20,358 - Epoch[0] Batch [400] Speed: 4.53 samples/sec accuracy=1.000000 2017-09-19 19:49:37,943 - Epoch[0] Batch [420] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:49:55,530 - Epoch[0] Batch [440] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:50:13,105 - Epoch[0] Batch [460] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:50:30,683 - Epoch[0] Batch [480] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:50:48,265 - Epoch[0] Batch [500] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:51:05,903 - Epoch[0] Batch [520] Speed: 4.54 samples/sec accuracy=1.000000 2017-09-19 19:51:23,492 - Epoch[0] Batch [540] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:51:41,176 - Epoch[0] Batch [560] Speed: 4.52 samples/sec accuracy=1.000000 2017-09-19 19:51:58,766 - Epoch[0] Batch [580] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:52:16,347 - Epoch[0] Batch [600] Speed: 4.55 samples/sec accuracy=1.000000 2017-09-19 19:52:33,933 - Epoch[0] Batch [620] Speed: 4.55 samples/sec accuracy=1.000000 [19:52:38] /home/harold/mxnet/dmlc-core/include/dmlc/logging.h:308: [19:52:38] src/io/image_io.cc:165: Check failed: !dst.empty() Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f9450196c8c] [bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io12ImdecodeImplEibPvmPNS_7NDArrayE+0x67a) [0x7f9451a3eefa] [bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataS1_S3_+0x23) [0x7f9450284963] [bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKc+0x8b) [0x7f94519daf4b] [bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6Engine8PushSyncESt8functionIFvNS_10RunContextEEENS_7ContextERKSt6vectorIPNS_6engine3VarESaIS9_EESD_NS_10FnPropertyEiPKc+0x124) [0x7f9450285814] [bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io8ImdecodeERKN4nnvm9NodeAttrsERKSt6vectorINS_7NDArrayESaIS6_EEPS8_+0xc90) [0x7f9451a40a10] [bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z20ImperativeInvokeImplRKN5mxnet7ContextEON4nnvm9NodeAttrsEPSt6vectorINS_7NDArrayESaIS7_EESA_PS6_IbSaIbEESD_+0x3cf) [0x7f94519a91ff] [bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z22MXImperativeInvokeImplPviPS_PiPS0_iPPKcS5_+0x25b) [0x7f94519bb43b] [bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvokeEx+0x2f) [0x7f94519a982f] [bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f9468629adc] Traceback (most recent call last): File "train.py", line 142, in <module> image_shape='3,224,224', epoch=0, num_epoch=args.num_epoch, kv=kv) File "train.py", line 106, in train_model epoch_end_callback=checkpoint) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/module/base_module.py", line 491, in fit next_data_batch = next(data_iter) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 1151, in next data = self.imdecode(s) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 1183, in imdecode return imdecode(s) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 136, in imdecode return _internal._cvimdecode(buf, *args, **kwargs) File "<string>", line 16, in _cvimdecode File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke ctypes.byref(out_stypes))) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/base.py", line 143, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [19:52:38] src/io/image_io.cc:165: Check failed: !dst.empty() Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f9450196c8c] [bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io12ImdecodeImplEibPvmPNS_7NDArrayE+0x67a) [0x7f9451a3eefa] [bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataS1_S3_+0x23) [0x7f9450284963] [bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKc+0x8b) [0x7f94519daf4b] [bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6Engine8PushSyncESt8functionIFvNS_10RunContextEEENS_7ContextERKSt6vectorIPNS_6engine3VarESaIS9_EESD_NS_10FnPropertyEiPKc+0x124) [0x7f9450285814] [bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io8ImdecodeERKN4nnvm9NodeAttrsERKSt6vectorINS_7NDArrayESaIS6_EEPS8_+0xc90) [0x7f9451a40a10] [bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z20ImperativeInvokeImplRKN5mxnet7ContextEON4nnvm9NodeAttrsEPSt6vectorINS_7NDArrayESaIS7_EESA_PS6_IbSaIbEESD_+0x3cf) [0x7f94519a91ff] [bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z22MXImperativeInvokeImplPviPS_PiPS0_iPPKcS5_+0x25b) [0x7f94519bb43b] [bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvokeEx+0x2f) [0x7f94519a982f] [bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f9468629adc] [19:52:38] src/engine/naive_engine.cc:53: Engine shutdown ## please provide the commands you have run that lead to the error. I used the pretrained model from https://github.com/cypw/DPNs commands: python train.py --epoch 0 --model ./models/dpn92-extra --batch-size 4 --num-classes 2 --data-train ./lst_train.lst --image-train ./data/ --data-val ./lst_val.lst --image-val ./data/ --num-examples 2000 --lr 0.001 --gpus 0 --num-epoch 20 --save-result ./output I have tried --batch-size 16/32, and I got the same result. ## What have you tried to solve it? At first, I got a result: An fatal error occurred in asynchronous engine operation. According to a guide, I set environment MXNET_CUDNN_AUTOTUNE_DEFAULT=0 and MXNET_ENGINE_TYPE=NaiveEngine, then I got the above result. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services