piiswrong commented on a change in pull request #8342: [WIP] 2bit gradient compression URL: https://github.com/apache/incubator-mxnet/pull/8342#discussion_r146112744
########## File path: python/mxnet/kvstore.py ########## @@ -349,6 +349,101 @@ def row_sparse_pull(self, key, out=None, priority=0, row_ids=None): check_call(_LIB.MXKVStorePullRowSparse( self.handle, mx_uint(len(ckeys)), ckeys, cvals, crow_ids, ctypes.c_int(priority))) + def set_compress(self, compress_params=None): + """ Specifies type of low-bit quantization for gradient compression if any, + and additional arguments depending on the type of compression being used. + + Parameters + ---------- + compress_params : dict + `compress_params` is a dictionary specifying the type and parameters + for gradient compression. The key `compress` in this dictionary is a required argument + and specifies the type of gradient compression. Other keys in this + dictionary are optional and specific to the type of gradient compression. + + 2bit Gradient Compression + --------- + 2bit gradient compression takes two thresholds, one for positive values and + other for negative thresholds. This works by limiting positive values in the + gradient to the positive threshold, and limiting negative values to the + negative threshold. Values which don't meet the thresholds are set to 0. + By doing so, each value in the gradient is in one of three states. 2bits are + used to represent these states, and every 16 float values in the original + gradient can be represented using one float. This compressed representation + can reduce communication costs. The difference between these values and + original values is stored at the sender's end as residual and added to the + gradient in the next iteration. + + When kvstore is 'local', gradient compression is used to reduce communication + between multiple devices (gpus). Gradient is quantized on each GPU which + computed the gradients, then sent to the GPU which merges the gradients. This + receiving GPU dequantizes the gradients and merges them. Note that this + increases memory usage on each GPU because of the residual array stored. + + When kvstore is 'dist', gradient compression is used to reduce communication + from worker to sender. Gradient is quantized on each worker which + computed the gradients, then sent to the server which dequantizes + this data and merges the gradients from each worker. Note that this + increases CPU memory usage on each worker because of the residual array stored. + Only worker to server communication is compressed in this setting. + If each machine has multiple GPUs, currently this GPU to GPU communication is + not compressed. Server to worker communication (in the case of pull) is also not + compressed. + + To use 2bit compression, we need to specify `compress` as `2bit`. + Only specifying `compress` would use default values + for the other arguments of thresholds. + To completely specify the arguments for 2bit compression, we would need to pass + a dictionary which includes `positive_threshold` and `negative_threshold` like: + {'compress':'2bit', 'positive_threshold':0.5, 'negative_threshold':-0.5} Review comment: is it positive_threshold or pos_threshold? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services