chrishkchris opened a new pull request #566: SINGA-487 Add Sparsification 
Algorithms
URL: https://github.com/apache/singa/pull/566
 
 
   This PR implements some sparsification schemes, we transfer only gradient 
elements which are significant. When we make use of cuda thrust parallel 
algorithm to convert the dense array into sparse array, the overhead is 
relatively low.
   
   It supports two mode, controlled by the flag topK:
   1. When topK is False, it transmits the gradient elements which are greater 
than an absolute threshold value.
   2. When topK is True, it transmits the K largest gradient element, where K 
equals the total number of elements multiplies the spars factor.  
   Moreover, there is a flag corr to use the local accumulate gradient for 
correction. The flag is true by default, because it is common to use the local 
accumulate gradient correction in sparsification.
   
   Some reference papers for the Sparsification:
   [1] N. Strom. Scalable distributed dnn training using commodity gpu cloud 
computing. In Proceedings of gpu cloud computing. In Proceedings of the 
InterSpeech 2015. International Speech
   Communication Association (ISCA), September 2015.
   [2] A. F. Aji and K. Hea
   eld. Sparse communication for distributed gradient descent. In Proceedings 
of the 2017 Conference on Empirical Methods in Natural Language Processing 
(EMNLP 2017), pages 440{445. Association for Computational Linguistics (ACL), 
September 2017.
   
   I have added an examples file sparsification_mnist.py to test the accuracy. 
The following results is based on a 8 GPUs AWS instance p2.x8large of the GPU 
model K80. 
   
   ```
   ubuntu@ip-172-31-18-216:~/singa/examples/autograd$ python3 
sparsification_mnist.py
   Starting Epoch 0:
   Training loss = 1237.824951, training accuracy = 0.537627
   Evaluation accuracy = 0.831209, Elapsed Time = 1.364238s
   Starting Epoch 1:
   Training loss = 468.859161, training accuracy = 0.835053
   Evaluation accuracy = 0.931229, Elapsed Time = 0.687484s
   Starting Epoch 2:
   Training loss = 329.488220, training accuracy = 0.887604
   Evaluation accuracy = 0.949424, Elapsed Time = 0.713595s
   Starting Epoch 3:
   Training loss = 220.463303, training accuracy = 0.925731
   Evaluation accuracy = 0.955592, Elapsed Time = 0.686450s
   Starting Epoch 4:
   Training loss = 171.178146, training accuracy = 0.942141
   Evaluation accuracy = 0.961760, Elapsed Time = 0.686534s
   Starting Epoch 5:
   Training loss = 149.635681, training accuracy = 0.950237
   Evaluation accuracy = 0.974198, Elapsed Time = 0.686791s
   Starting Epoch 6:
   Training loss = 124.092453, training accuracy = 0.958300
   Evaluation accuracy = 0.973376, Elapsed Time = 0.686136s
   Starting Epoch 7:
   Training loss = 115.288582, training accuracy = 0.961205
   Evaluation accuracy = 0.968647, Elapsed Time = 0.686174s
   Starting Epoch 8:
   Training loss = 99.048584, training accuracy = 0.966864
   Evaluation accuracy = 0.981188, Elapsed Time = 0.685848s
   Starting Epoch 9:
   Training loss = 84.038574, training accuracy = 0.972239
   Evaluation accuracy = 0.981188, Elapsed Time = 0.685568s
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to