chrishkchris opened a new pull request #762: URL: https://github.com/apache/singa/pull/762
A fix of error in training loss, the expected loss I used for long time is appeared wrong in the dev branch in distributed training, Before fix: ``` root@64926e30597f:~/dcsysh/singa/examples/cnn# mpiexec -np 3 python3 train_mpi.py cnn mnist -l 0.015 Starting Epoch 0: Training loss = 867.269531, training accuracy = 0.682409 Evaluation accuracy = 0.913662, Elapsed Time = 1.374367s Starting Epoch 1: Training loss = 312.582123, training accuracy = 0.893546 Evaluation accuracy = 0.946014, Elapsed Time = 1.324747s Starting Epoch 2: Training loss = 223.973038, training accuracy = 0.924312 Evaluation accuracy = 0.955629, Elapsed Time = 1.325152s Starting Epoch 3: Training loss = 176.310730, training accuracy = 0.939804 Evaluation accuracy = 0.965645, Elapsed Time = 1.327019s Starting Epoch 4: Training loss = 146.806168, training accuracy = 0.950220 Evaluation accuracy = 0.969451, Elapsed Time = 1.320603s Starting Epoch 5: Training loss = 124.658463, training accuracy = 0.958784 Evaluation accuracy = 0.970653, Elapsed Time = 1.317975s Starting Epoch 6: Training loss = 112.322250, training accuracy = 0.962724 Evaluation accuracy = 0.972857, Elapsed Time = 1.343767s Starting Epoch 7: Training loss = 102.903122, training accuracy = 0.965044 Evaluation accuracy = 0.971254, Elapsed Time = 1.316032s Starting Epoch 8: Training loss = 96.206215, training accuracy = 0.967798 Evaluation accuracy = 0.971354, Elapsed Time = 1.292748s Starting Epoch 9: Training loss = 90.059357, training accuracy = 0.969785 Evaluation accuracy = 0.981170, Elapsed Time = 1.301958s ``` After fix: root@64926e30597f:~/dcsysh/singa/examples/cnn# mpiexec -np 3 python3 train_mpi.py cnn mnist -l 0.015 ``` Starting Epoch 0: Training loss = 653.234863, training accuracy = 0.767194 Evaluation accuracy = 0.936498, Elapsed Time = 1.364626s Starting Epoch 1: Training loss = 245.488037, training accuracy = 0.917201 Evaluation accuracy = 0.959435, Elapsed Time = 1.311175s Starting Epoch 2: Training loss = 174.001266, training accuracy = 0.941757 Evaluation accuracy = 0.959736, Elapsed Time = 1.324813s Starting Epoch 3: Training loss = 141.203125, training accuracy = 0.953292 Evaluation accuracy = 0.971054, Elapsed Time = 1.330215s Starting Epoch 4: Training loss = 119.192688, training accuracy = 0.959519 Evaluation accuracy = 0.973758, Elapsed Time = 1.302892s Starting Epoch 5: Training loss = 107.171661, training accuracy = 0.964443 Evaluation accuracy = 0.975761, Elapsed Time = 1.314337s Starting Epoch 6: Training loss = 97.575897, training accuracy = 0.966513 Evaluation accuracy = 0.977764, Elapsed Time = 1.304296s Starting Epoch 7: Training loss = 89.828827, training accuracy = 0.970753 Evaluation accuracy = 0.975561, Elapsed Time = 1.316111s Starting Epoch 8: Training loss = 84.263199, training accuracy = 0.972189 Evaluation accuracy = 0.979868, Elapsed Time = 1.298452s Starting Epoch 9: Training loss = 78.318733, training accuracy = 0.974059 Evaluation accuracy = 0.981370, Elapsed Time = 1.308062s ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
