rahul003 opened a new pull request #8342: [WIP] 2bit gradient compression
URL: https://github.com/apache/incubator-mxnet/pull/8342
 
 
   ## Description ##
   Implements 2bit gradient compression by quantizing each value in gradient 
array to 2bits using two user specified thresholds, one for positive and one 
for negative values. 
   
   @eric-haibin-lin @piiswrong @reminisce @anirudh2290 @bhavinthaker @madjam 
@cjolivier01 
   Please review. This is a work in progress. I'm currently running this with 
different kind of models to get performance results.
   
   ### Important files to review
   Operator
   - two_bit_quantize-inl.h
   - two_bit_quantize.cc
   
   KVStore local
   - comm.h
   
   KVStore dist
   - kvstore_dist.h
   - kvstore_dist_server.h
   
   Documentation about gradient compression
   - kvstore.py
   - two_bit_quantize.cc
   
   ## Checklist ##
   ### Essentials ###
   - [ ] Passed code style checking (`make lint`)
   - [ ] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage
   - [ ] For user-facing API changes, API doc string has been updated.
   - [ ] To my best knowledge, examples are either not affected by this change, 
or have been fixed to be compatible with this change
   
   ### Changes ###
   - [ ] two-bit-quantize and dequantize operators
   - [ ] Reduce operation in kvstore_local / comm.h
   - [ ] Distributed kvstore changes at worker and server
   - [ ] Tests for operator, local kvstore, distributed kvstore with predefined 
and random data. The results have been compared with expected values by 
implementing this logic in python.
   - [ ] API changes for Kvstore, Module and Trainer in python
   
   ## Comments ##
   ### Problem 
   When training large scale deep learning models especially with distributed 
training, communication becomes a bottleneck for networks whose computation is 
not high compared to the communication. 
   
   ### Approach
   We can try to address this by quantizing the gradients before sending and 
dequantizing it at the receiver's end. The sender would retain the quantization 
error and add it to the next iteration, effectively delaying small updates to 
positions in the gradient. Specifically in this PR, currently 2bit quantization 
has been implemented.
   
   ### Two bit quantization
   Use two thresholds to quantize the data, one positive threshold and one 
negative threshold. Any positive value greater than or equal to the positive 
threshold is set to one value (say 01), any negative value lower than or equal 
to the negative threshold is set to second value (say 10), and others are set 
to third value (say 0). We need three values to represent data in this fashion 
and hence two bits. We understand this leads to one bit going waste, but that's 
an optimization for later, as it complicates the operators. The error in 
quantization is stored as residual and carried over to the next iteration. This 
is added in the next iteration to the gradient before quantizing. 
   An example below with thresholds of -2.0 and 2.0
   ![Quantization at work](https://i.imgur.com/AtBVg92.png)
   
   ### Format of compressed gradient
   The first two elements are the thresholds used for quantization. The third 
element is the size of the original array. These values are required to 
dequantize the gradient. Any element from the 4th element, represents 
compressed gradient. Each value from the 4th element, represents upto 16 
elements in the original array. For the example above, we get
   ```compr = [ -2.0, 2.0, 8, 6.1215606E-28]```
   Note that the binary representation of the last element is 
   ```00 01 00 10 01 00 00 10  0000000000000000```
   
   ### Local kvstore
   When using local kvstore, gradients compression only happens when using 
device communication. When gradients are pushed, before summing them up 
(Reduce), quantization and dequantization happen.
   Example: Say we have 4 GPUs, and the gradients are being summed up on GPU0. 
Each device quantizes gradients, then sends quantized gradient to GPU0, which 
performs dequantization of this data before merging it with values from other 
GPUs. Note that here, there is no need to quantize gradients from GPU0 itself, 
but it is still being done so that there is no bias for the samples which were 
processed by GPU0. **Please let me know if this is not a good idea.**
   
   ### Dist kvstore
   When the set_compress method for kvstore is called, each worker sets those 
compress params and one worker sends these params to all servers. From then on, 
when before each value is pushed to the server, it is quantized. The server 
dequantizes the data and stores it as an array of the original size. When 
values are pulled from the server, it returns an array of the original size. 
The same happens when each server is handling shards of the data.
   
   ### Usage
   The reason I used a dictionary compress_params for the arguments was to 
ensure uniformity when we extend this to other quantization techniques. This is 
because each technique would take different type and number of parameters.
   #### Operators 
   ```
   compr = mx.nd.contrib.create_2bit(grad)
   mx.nd.contrib.quantize_2bit(grad, residual, compr, neg_threshold, 
pos_threshold)
   mx.nd.contrib.dequantize_2bit(compr, decompr)
   ```
   #### KVstore
   ```
   kv = mx.kv.create('dist_sync')
   kv.set_compress({'compress':'2bit', 'pos_threshold':0.5, 
'neg_threshold':-0.5})
   ```
   #### Module
   ```
   mod = mx.mod.Module(net, compress_params={'compress':'2bit', 
'pos_threshold':0.5, 'neg_threshold':-0.5})
    ```
   #### Gluon Trainer
   ```
   trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1}, 
                           compress_params={'compress':'2bit', 
'pos_threshold':0.5, 'neg_threshold':-0.5})
   ```
   
   ### Questions
   1. [Refer kvstore local section above] Should I not do quantization and 
dequantization for the gradient on the GPU which is performing reduction?
   
   2. On running nvprof for the GPU operator with large arrays, most time is 
taken by the MapPlanKernel and cudaStreamSynchronize like below. For small 
arrays, the quantize, dequantize operations are on top. Any suggestions to 
reduce the time taken by these operations in the operator?
   ```
   Time(%)      Time     Calls       Avg       Min       Max  Name
    48.17%  11.9536s     50002  239.06us  223.23us  357.70us  void 
mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, 
mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, 
mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, 
unsigned int, mshadow::Shape<int=2>, int=1)
   ...
   
   ==24796== API calls:
   Time(%)      Time     Calls       Avg       Min       Max  Name
    63.81%  26.3938s    200002  131.97us  4.2860us  6.0096ms  
cudaStreamSynchronize
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to