[GitHub] [incubator-mxnet] moveforever edited a comment on issue #18344: Failed to train model trained by mxnet 1.0 in mxnet 2.0

GitBox Tue, 19 May 2020 02:47:19 -0700


moveforever edited a comment on issue #18344:
URL: 
https://github.com/apache/incubator-mxnet/issues/18344#issuecomment-629734656



   ## Description
   i upgrade mxnet version from 1.0 to 2.0. I load the model which is trained 
at 1.0 in mxnet 2.0, and when i train the model, it came across the situation 
that it stops after 30 batches, which seems to be hanged.
   
   ### Error Message
   (Paste the complete error message. Please also include stack trace by 
setting environment variable `DMLC_LOG_STACK_TRACE_DEPTH=10` before running 
your script.)
   There is no error information, and it seems to be hanged!
   ```
   + export DMLC_LOG_STACK_TRACE_DEPTH=10
   + DMLC_LOG_STACK_TRACE_DEPTH=10
   + curl --retry 10 -s 
https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py
   + /thirdparty/bin/python3 src/train.py
   20
   INFO:root:['dense', 'cate', 'sparse', 'field_0', 'field_1', 'field_2', 
'field_3', 'field_4', 'fi
   eld_5', 'field_6']
   INFO:root:only use embedding model
   [10:37:49] src/base.cc:51: Upgrade advisory: this mxnet has been built 
against cuda library vers
   ion 9010, which is older than the oldest version tested by CI (10000).  Set 
MXNET_CUDA_LIB_CHECK
   ING=0 to quiet this warning.
   INFO:root:Training started ...
   INFO:root:Epoch[0] Batch [0-10]        Speed: 1817.71 samples/sec      
auc=0.500820    multi_log
   loss=10.058306 multi_mse=0.000000
   INFO:root:batch=10, forward_backward=21ms, update=174ms, 
update_metric=6679ms, data=15703ms, tot
   al=22579ms
   INFO:root:Epoch[0] Batch [10-20]       Speed: 3873.38 samples/sec      
auc=0.513181    multi_log
   loss=7.157333  multi_mse=0.974069
   INFO:root:batch=20, forward_backward=17ms, update=185ms, 
update_metric=4430ms, data=5941ms, tota
   l=10574ms
   INFO:root:Epoch[0] Batch [20-30]       Speed: 4769.79 samples/sec      
auc=0.539163    multi_log
   loss=4.971608  multi_mse=0.000000
   INFO:root:batch=30, forward_backward=13ms, update=121ms, 
update_metric=3803ms, data=4648ms, tota
   l=8586ms
   ```
   
   ## To Reproduce
   (If you developed your own code, please provide a short script that 
reproduces the error. For existing examples, please provide link.)
   
   it may be diffult to reproduce my problem
   I implement a DataIter through c++ mxnet source code to support 
multi-storage  and multi-label sample as followed, and it runs well at mxnet. 
   The row is splited by ^A(\001) as folowed. The first column is label, and 
the second column is dense feature, and the third column is categorical 
feature, and the the fourth column is multi-hot categorical feature which can 
be splited by comma , and the fifth column is sparse feature which is support 
wide input for google wide and deep model.
   
![image](https://user-images.githubusercontent.com/5248288/82134858-f84dfe80-982e-11ea-9dd3-cf442c5640c2.png)
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   1.
   2.
   
   ## What have you tried to solve it?
   
   1.
   2.
   
   ## Environment
   cuda-9.0
   gcc 8.4
   centos 7.2
   python 3.7
   We recommend using our script for collecting the diagnositc information. Run 
the following command and paste the outputs below:
   `curl --retry 10 -s 
https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | 
python`
   There is no information, and it seems to be hanged.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-mxnet] moveforever edited a comment on issue #18344: Failed to train model trained by mxnet 1.0 in mxnet 2.0

Reply via email to