chrishkchris opened a new pull request #566: SINGA-487 Add Sparsification Algorithm: Threshold Quantization URL: https://github.com/apache/singa/pull/566 This PR implements a simple sparsification scheme, we transfer only gradient value which is greater than an absolute threshold value. When we make use of cuda thrust parallel algorithm to convert the dense matrix into sparse matrix, the overhead is relatively low. Some reference papers for the Sparsification: [1] N. Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Proceedings of gpu cloud computing. In Proceedings of the InterSpeech 2015. International Speech Communication Association (ISCA), September 2015. [2] A. F. Aji and K. Hea eld. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 440{445. Association for Computational Linguistics (ACL), September 2017. I have added an examples file sparsification_mnist.py to test the accuracy. The following results is based on a 4 GPUs AWS instance g4.dn12xlarge of the GPU model T4. ``` ubuntu@ip-172-31-20-160:~/singa/examples/autograd$ python3 sparsification_mnist.py Starting Epoch 0: Training loss = 809.631958, training accuracy = 0.709352 Evaluation accuracy = 0.905849, Elapsed Time = 1.251285s Starting Epoch 1: Training loss = 325.436279, training accuracy = 0.888906 Evaluation accuracy = 0.936098, Elapsed Time = 0.882350s Starting Epoch 2: Training loss = 238.643738, training accuracy = 0.920106 Evaluation accuracy = 0.952424, Elapsed Time = 0.847908s Starting Epoch 3: Training loss = 200.181030, training accuracy = 0.933377 Evaluation accuracy = 0.947616, Elapsed Time = 0.839072s Starting Epoch 4: Training loss = 182.340820, training accuracy = 0.938969 Evaluation accuracy = 0.962240, Elapsed Time = 0.836915s Starting Epoch 5: Training loss = 161.267120, training accuracy = 0.946615 Evaluation accuracy = 0.970653, Elapsed Time = 0.839940s Starting Epoch 6: Training loss = 147.990921, training accuracy = 0.951356 Evaluation accuracy = 0.970753, Elapsed Time = 0.842795s Starting Epoch 7: Training loss = 139.301285, training accuracy = 0.953626 Evaluation accuracy = 0.973458, Elapsed Time = 0.842011s Starting Epoch 8: Training loss = 131.042053, training accuracy = 0.956564 Evaluation accuracy = 0.963241, Elapsed Time = 0.840951s Starting Epoch 9: Training loss = 126.376511, training accuracy = 0.957732 Evaluation accuracy = 0.967448, Elapsed Time = 0.841526s ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
