chrishkchris edited a comment on pull request #697: URL: https://github.com/apache/singa/pull/697#issuecomment-637375715
I am using this PR to train Xceptionnet in order to use the save_state function, but I encountered something strange: (i) The training and evaluation were both okay in https://github.com/apache/singa/pull/651 ``` (singa) dcsysh@panda7:~/singa/examples/autograd$ python3 train.py xceptionnet ci Starting Epoch 0: Training loss = 11198.645508, training accuracy = 0.214420 Evaluation accuracy = 0.309000, Elapsed Time = 606.547117s Starting Epoch 1: Training loss = 6354.611328, training accuracy = 0.381020 Evaluation accuracy = 0.457300, Elapsed Time = 612.817129s ``` (ii) This time I think the training is okay, but something wrong in the evaluation ``` root@e8a757397ca3:~/dcsysh/singa/examples/cnn# mpiexec -np 8 python3 train_mpi.py xceptionnet cifar10 --bs 16 --lr 0.04 --epoch 30 Starting Epoch 0: Training loss = 11614.897461, training accuracy = 0.131190 Evaluation accuracy = 0.099860, Elapsed Time = 98.705291s Starting Epoch 1: Training loss = 6932.552246, training accuracy = 0.157552 Evaluation accuracy = 0.099860, Elapsed Time = 98.400360s Starting Epoch 2: Training loss = 6565.343262, training accuracy = 0.195853 Evaluation accuracy = 0.099960, Elapsed Time = 99.807898s Starting Epoch 3: Training loss = 6173.305176, training accuracy = 0.254467 Evaluation accuracy = 0.099960, Elapsed Time = 99.759293s Starting Epoch 4: Training loss = 5841.223633, training accuracy = 0.306430 Evaluation accuracy = 0.099960, Elapsed Time = 99.962356s Starting Epoch 5: Training loss = 5526.505859, training accuracy = 0.350821 Evaluation accuracy = 0.100060, Elapsed Time = 100.282988s Starting Epoch 6: Training loss = 5319.209473, training accuracy = 0.376542 Evaluation accuracy = 0.100060, Elapsed Time = 99.520091s Starting Epoch 7: Training loss = 5106.029297, training accuracy = 0.402684 Evaluation accuracy = 0.100060, Elapsed Time = 99.491482s Starting Epoch 8: Training loss = 4916.409180, training accuracy = 0.424820 Evaluation accuracy = 0.100060, Elapsed Time = 99.767488s Starting Epoch 9: Training loss = 4734.987793, training accuracy = 0.446054 Evaluation accuracy = 0.100060, Elapsed Time = 99.660972s Starting Epoch 10: Training loss = 4584.931641, training accuracy = 0.465365 Evaluation accuracy = 0.100060, Elapsed Time = 100.107028s Starting Epoch 11: Training loss = 4360.736816, training accuracy = 0.492748 Evaluation accuracy = 0.100060, Elapsed Time = 99.807331s Starting Epoch 12: Training loss = 4216.152344, training accuracy = 0.514243 Evaluation accuracy = 0.100060, Elapsed Time = 99.772958s Starting Epoch 13: Training loss = 4064.178955, training accuracy = 0.532192 Evaluation accuracy = 0.100060, Elapsed Time = 100.053775s Starting Epoch 14: Training loss = 3899.273926, training accuracy = 0.550962 Evaluation accuracy = 0.100060, Elapsed Time = 106.455404s Starting Epoch 15: Training loss = 3733.515137, training accuracy = 0.576242 Evaluation accuracy = 0.100060, Elapsed Time = 102.990761s Starting Epoch 16: Training loss = 3591.209961, training accuracy = 0.592167 Evaluation accuracy = 0.100060, Elapsed Time = 100.279051s Starting Epoch 17: Training loss = 3453.231201, training accuracy = 0.608454 Evaluation accuracy = 0.100060, Elapsed Time = 100.323891s Starting Epoch 18: Training loss = 3293.441406, training accuracy = 0.625942 Evaluation accuracy = 0.100060, Elapsed Time = 100.243008s Starting Epoch 19: Training loss = 3145.550293, training accuracy = 0.644231 Evaluation accuracy = 0.100060, Elapsed Time = 100.145333s Starting Epoch 20: Training loss = 3018.382568, training accuracy = 0.659976 Evaluation accuracy = 0.100060, Elapsed Time = 99.985306s Starting Epoch 21: Training loss = 2867.048828, training accuracy = 0.677083 Evaluation accuracy = 0.100060, Elapsed Time = 100.097360s Starting Epoch 22: Training loss = 2743.534424, training accuracy = 0.689784 Evaluation accuracy = 0.100060, Elapsed Time = 99.774135s Starting Epoch 23: Training loss = 2646.668457, training accuracy = 0.703105 Evaluation accuracy = 0.100060, Elapsed Time = 99.958771s Starting Epoch 24: Training loss = 2525.976562, training accuracy = 0.717468 Evaluation accuracy = 0.100060, Elapsed Time = 99.577777s Starting Epoch 25: Training loss = 2429.261230, training accuracy = 0.729988 Evaluation accuracy = 0.100060, Elapsed Time = 100.078185s Starting Epoch 26: Training loss = 2350.896484, training accuracy = 0.739203 Evaluation accuracy = 0.100060, Elapsed Time = 100.012700s Starting Epoch 27: Training loss = 2255.607666, training accuracy = 0.748598 Evaluation accuracy = 0.100060, Elapsed Time = 99.678916s Starting Epoch 28: Training loss = 2199.779541, training accuracy = 0.753686 Evaluation accuracy = 0.100060, Elapsed Time = 100.552001s Starting Epoch 29: Training loss = 2120.205566, training accuracy = 0.765725 Evaluation accuracy = 0.099960, Elapsed Time = 100.228618s ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
