[apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

Haibin Lin Tue, 12 Nov 2019 11:43:53 -0800

## Background  
Data parallel training is the most common distributed training technique when 
it comes to multiple GPUs or multiple hosts. Currently, several communication 
backends provide functionalities for communicating tensors across devices/hosts 
for data parallel training. For MXNet users, there are a few options: 
1. native kvstore
2. [p3 kvstore](https://www.sysml.cc/doc/2019/75.pdf)
3. [horovod](https://github.com/horovod/horovod/) 
4. [bytePS](https://github.com/bytedance/byteps/)


These different implementations provide different APIs:
- native kvstore
  - high level APIs: `mx.gluon.Trainer`
  - low level APIs: `kv.push`, `kv.pull`, `kv.init`
- horovod
  - high level APIs: `hvd.init()`, `hvd.DistributedTrainer`
  - low level APIs: `hvd.broadcast`, `hvd.allreduce`
- bytePS
  - high level APIs: `bps.init()`, `bps.DistributedTrainer`
  - low level APIs: `byteps_declare_tensor`, `byteps_push_pull`

Here, high level APIs refers to the API a typical novice user uses for a 
distributed training job. To communicate tensors not managed by `Trainer` or 
`DistributedTrainer`s, users may refer to the low level APIs to send/receive a 
custom tensor. 

## Problem Statement
Sometimes we want to easily switch between these different distributed 
communication backends and compare which one performs the best for a particular 
distributed training environment. Due to different APIs of these 
implementations, it requires lots of user code changes to try each one of them. 
It typically involves custom logics to:
1. launch python processes for distributed training job ([BytePS 
launch](https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#mxnet)
 v.s. [horovod](https://github.com/horovod/horovod#running-horovod)) 
2. initialize communication backends ([example 
code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L187-L228))
3. create (Distributed)Trainers ([example 
code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L297-L303))
4. send custom tensors ([example 
code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L582-L586))

## Proposal 

My proposal is to provide a unified API to allow custom communication backends 
as plugins for MXNet, so that no new user code is required to switch between 
these backends.

Specifically, communication backend provider implements the following python 
APIs.

class `AbstractKVStore`:
- def __init__(): initialization
- def broadcast(name, tensor, root_rank): broadcast the `tensor` at `root_rank` 
to all ranks
  - name: tensor name. int or str
  - tensor: ndarray 
- def push_pull(name, tensor, output): push `tensor` and pull in `output`. When 
optimizer is not set, it performs summation of `tensor` from all ranks. The 
result of the summation is then pulled back to `output` tensor. 
  - name: tensor name. int or str
  - tensor: ndarray to push
  - output: ndarray to store the output of pull 
- def set_optimizer(optimizer): set the optimizer at parameter servers. 
Optional interface, only used for parameter server based backends. 
  - optimizer: mx.optimizer.Optimizer

A communication backend provider can implement these APIs and register a new 
KVStore in MXNet via `mx.kv.register()`. For MXNet users, they only need to 
interact with the following MXNet APIs:
- using high level APIs for a typical data parallel model training
```
backend = mx.kv.create('horovod')
trainer = mx.gluon.Trainer(kv=backend)
# forward: loss = net(data)
# backward: loss.backward()
# update: trainer.step()
```
- using low level APIs to reduce a custom tensor
```
kv.broadcast("name", custom_ndarray, root_rank=0)
kv.push_pull("name", custom_ndarray, out=custom_ndarray)
```

## Limitation 

The unified interfaces do not advanced features such as sparse ndarrays or 
gradient compression, which is less mature and not provided by all 
communication backends. 

The above proposal targets use case 2,3,4 in the problem statement. It can be 
extended to tackle 1 as well if the feedbacks are positive. 

@ymjiang @apeforest @anandj91 @rich-junwang 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/16795

[apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

Reply via email to