moveforever edited a comment on issue #18344: URL: https://github.com/apache/incubator-mxnet/issues/18344#issuecomment-629734656
## Description i upgrade mxnet version from 1.0 to 2.0. I load the model which is trained at 1.0 in mxnet 2.0, and when i train the model, it came across the situation that it stops after 30 batches, which seems to be hanged. ### Error Message (Paste the complete error message. Please also include stack trace by setting environment variable `DMLC_LOG_STACK_TRACE_DEPTH=10` before running your script.) There is no error information, and it seems to be hanged! ``` + export DMLC_LOG_STACK_TRACE_DEPTH=10 + DMLC_LOG_STACK_TRACE_DEPTH=10 + curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py + /thirdparty/bin/python3 src/train.py 20 INFO:root:['dense', 'cate', 'sparse', 'field_0', 'field_1', 'field_2', 'field_3', 'field_4', 'fi eld_5', 'field_6'] INFO:root:only use embedding model [10:37:49] src/base.cc:51: Upgrade advisory: this mxnet has been built against cuda library vers ion 9010, which is older than the oldest version tested by CI (10000). Set MXNET_CUDA_LIB_CHECK ING=0 to quiet this warning. INFO:root:Training started ... INFO:root:Epoch[0] Batch [0-10] Speed: 1817.71 samples/sec auc=0.500820 multi_log loss=10.058306 multi_mse=0.000000 INFO:root:batch=10, forward_backward=21ms, update=174ms, update_metric=6679ms, data=15703ms, tot al=22579ms INFO:root:Epoch[0] Batch [10-20] Speed: 3873.38 samples/sec auc=0.513181 multi_log loss=7.157333 multi_mse=0.974069 INFO:root:batch=20, forward_backward=17ms, update=185ms, update_metric=4430ms, data=5941ms, tota l=10574ms INFO:root:Epoch[0] Batch [20-30] Speed: 4769.79 samples/sec auc=0.539163 multi_log loss=4.971608 multi_mse=0.000000 INFO:root:batch=30, forward_backward=13ms, update=121ms, update_metric=3803ms, data=4648ms, tota l=8586ms ``` ## To Reproduce (If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.) it may be diffult to reproduce my problem I implement a DataIter through c++ mxnet source code to support multi-storage and multi-label sample as followed, and it runs well at mxnet. The row is splited by ^A(\001) as folowed. The first column is label, and the second column is dense feature, and the third column is categorical feature, and the the fourth column is multi-hot categorical feature which can be splited by comma , and the fifth column is sparse feature which is support wide input for google wide and deep model.  ### Steps to reproduce (Paste the commands you ran that produced the error.) 1. 2. ## What have you tried to solve it? 1. 2. ## Environment cuda-9.0 gcc 8.4 centos 7.2 python 3.7 We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: `curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python` There is no information, and it seems to be hanged. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
