## Background Data parallel training is the most common distributed training technique when it comes to multiple GPUs or multiple hosts. Currently, several communication backends provide functionalities for communicating tensors across devices/hosts for data parallel training. For MXNet users, there are a few options: 1. native kvstore 2. [p3 kvstore](https://www.sysml.cc/doc/2019/75.pdf) 3. [horovod](https://github.com/horovod/horovod/) 4. [bytePS](https://github.com/bytedance/byteps/)
These different implementations provide different APIs: - native kvstore - high level APIs: `mx.gluon.Trainer` - low level APIs: `kv.push`, `kv.pull`, `kv.init` - horovod - high level APIs: `hvd.init()`, `hvd.DistributedTrainer` - low level APIs: `hvd.broadcast`, `hvd.allreduce` - bytePS - high level APIs: `bps.init()`, `bps.DistributedTrainer` - low level APIs: `byteps_declare_tensor`, `byteps_push_pull` Here, high level APIs refers to the API a typical novice user uses for a distributed training job. To communicate tensors not managed by `Trainer` or `DistributedTrainer`s, users may refer to the low level APIs to send/receive a custom tensor. ## Problem Statement Sometimes we want to easily switch between these different distributed communication backends and compare which one performs the best for a particular distributed training environment. Due to different APIs of these implementations, it requires lots of user code changes to try each one of them. It typically involves custom logics to: 1. launch python processes for distributed training job ([BytePS launch](https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#mxnet) v.s. [horovod](https://github.com/horovod/horovod#running-horovod)) 2. initialize communication backends ([example code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L187-L228)) 3. create (Distributed)Trainers ([example code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L297-L303)) 4. send custom tensors ([example code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L582-L586)) ## Proposal My proposal is to provide a unified API to allow custom communication backends as plugins for MXNet, so that no new user code is required to switch between these backends. Specifically, communication backend provider implements the following python APIs. class `AbstractKVStore`: - def __init__(): initialization - def broadcast(name, tensor, root_rank): broadcast the `tensor` at `root_rank` to all ranks - name: tensor name. int or str - tensor: ndarray - def push_pull(name, tensor, output): push `tensor` and pull in `output`. When optimizer is not set, it performs summation of `tensor` from all ranks. The result of the summation is then pulled back to `output` tensor. - name: tensor name. int or str - tensor: ndarray to push - output: ndarray to store the output of pull - def set_optimizer(optimizer): set the optimizer at parameter servers. Optional interface, only used for parameter server based backends. - optimizer: mx.optimizer.Optimizer A communication backend provider can implement these APIs and register a new KVStore in MXNet via `mx.kv.register()`. For MXNet users, they only need to interact with the following MXNet APIs: - using high level APIs for a typical data parallel model training ``` backend = mx.kv.create('horovod') trainer = mx.gluon.Trainer(kv=backend) # forward: loss = net(data) # backward: loss.backward() # update: trainer.step() ``` - using low level APIs to reduce a custom tensor ``` kv.broadcast("name", custom_ndarray, root_rank=0) kv.push_pull("name", custom_ndarray, out=custom_ndarray) ``` ## Limitation The unified interfaces do not advanced features such as sparse ndarrays or gradient compression, which is less mature and not provided by all communication backends. The above proposal targets use case 2,3,4 in the problem statement. It can be extended to tackle 1 as well if the feedbacks are positive. @ymjiang @apeforest @anandj91 @rich-junwang -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/16795