chrishkchris opened a new pull request #787: URL: https://github.com/apache/singa/pull/787
Thanks Rulin @XJDKC for fixing the graph operation in this PR The problem was due to the workspace variable in rnn. The problem appears when the ops clean up the workspace variable (setValue->0) before use for rnn operation multiple times in the graph, while some ops using the workspace are independent to each other. To solve the problem, if a tensor is written by multiple independent ops, these ops should be performed by time. Some of the test: ``` root@64926e30597f:~/dcsysh/singa/examples/rnn# python3 imdb_train.py epoch 0 loss [0.6489457]; acc 0.617 epoch 1 loss [0.55472153]; acc 0.715 epoch 2 loss [0.51863945]; acc 0.743 epoch 3 loss [0.49822766]; acc 0.758 epoch 4 loss [0.48312518]; acc 0.767 eval acc 0.750 root@64926e30597f:~/dcsysh/singa/examples/cnn# python3 train_cnn.py resnet cifar10 -b 32 -m 1 Starting Epoch 0: Training loss = 2867.570801, training accuracy = 0.352753 Evaluation accuracy = 0.467448, Elapsed Time = 335.571664s root@64926e30597f:~/dcsysh/singa/examples/cnn# python3 train_cnn.py resnet cifar10 -b 32 -m 1 -g Starting Epoch 0: Training loss = 2866.714111, training accuracy = 0.352693 Evaluation accuracy = 0.488381, Elapsed Time = 379.508079s ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
