[GitHub] [incubator-mxnet] moveforever edited a comment on issue #18344: Failed to train model trained by mxnet 1.0 in mxnet 2.0

GitBox Sat, 16 May 2020 19:50:24 -0700


moveforever edited a comment on issue #18344:
URL: 
https://github.com/apache/incubator-mxnet/issues/18344#issuecomment-629734656



   ## Description
   i upgrade mxnet version from 1.0 to 2.0. I load the model which is trained 
at 1.0 in mxnet 2.0, and when i train the model, it came across the situation 
that it stops after 30 batches, which seems to be hanged.
   
   ### Error Message
   (Paste the complete error message. Please also include stack trace by 
setting environment variable `DMLC_LOG_STACK_TRACE_DEPTH=10` before running 
your script.)
   There is no error information, and it seems to be hanged!
   ```
   + export DMLC_LOG_STACK_TRACE_DEPTH=10
   + DMLC_LOG_STACK_TRACE_DEPTH=10
   + curl --retry 10 -s 
https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py
   + /thirdparty/bin/python3 src/train.py
   INFO:root:current path is: 
/cephfs/group/omg-omgmobile-app-tencent-sports/qomozheng/project/mult
   i_obj_opt
   20
   INFO:root:['dense', 'cate', 'sparse', 'field_0', 'field_1', 'field_2', 
'field_3', 'field_4', 'fi
   eld_5', 'field_6']
   INFO:root:only use embedding model
   [10:37:49] src/base.cc:51: Upgrade advisory: this mxnet has been built 
against cuda library vers
   ion 9010, which is older than the oldest version tested by CI (10000).  Set 
MXNET_CUDA_LIB_CHECK
   ING=0 to quiet this warning.
   INFO:root:Training started ...
   INFO:root:Epoch[0] Batch [0-10]        Speed: 1817.71 samples/sec      
auc=0.500820    multi_log
   loss=10.058306 multi_mse=0.000000
   INFO:root:batch=10, forward_backward=21ms, update=174ms, 
update_metric=6679ms, data=15703ms, tot
   al=22579ms
   INFO:root:Epoch[0] Batch [10-20]       Speed: 3873.38 samples/sec      
auc=0.513181    multi_log
   loss=7.157333  multi_mse=0.974069
   INFO:root:batch=20, forward_backward=17ms, update=185ms, 
update_metric=4430ms, data=5941ms, tota
   l=10574ms
   INFO:root:Epoch[0] Batch [20-30]       Speed: 4769.79 samples/sec      
auc=0.539163    multi_log
   loss=4.971608  multi_mse=0.000000
   INFO:root:batch=30, forward_backward=13ms, update=121ms, 
update_metric=3803ms, data=4648ms, tota
   l=8586ms
   ```
   
   ## To Reproduce
   (If you developed your own code, please provide a short script that 
reproduces the error. For existing examples, please provide link.)
   
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1.
   2.
   
   ## What have you tried to solve it?
   
   1.
   2.
   
   ## Environment
   cuda-9.0
   gcc 8.4
   centos 7.2
   python 3.7
   We recommend using our script for collecting the diagnositc information. Run 
the following command and paste the outputs below:
   `curl --retry 10 -s 
https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | 
python`
   There is no information, and it seems to be hanged.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-mxnet] moveforever edited a comment on issue #18344: Failed to train model trained by mxnet 1.0 in mxnet 2.0

Reply via email to