Thanks for your contribution, Carl.

I remember I left a comment on the proposal, but today I found it was
disappeared. My suggestion is trying best to not change the existing API.
The reason is that we need to change all trainers on the frontend that uses
the existing kvstore APIs, which may cause confusion to users.

The current proposal wants add the following 4 APIs into kvstore:


   -

   kv.pushpull
   -

   kv.broadcast
   -

   kv.local_rank
   -

   kv.num_local_workers


Pushpull can be done with a sequential push and pull, you can do nothing in
push and put all workloads into pushpull. Broadcast can be implemented by
pull.

What's local workers? GPUs in the single machine? If so, we can query it
directly.


On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <carl14...@gmail.com> wrote:

> Hi,
>
> Currently, MXNet distributed can only be done using parameter server.
> Horovod is an open-source distributed training framework that has
> shown 2x speedup compared to TensorFlow using Parameter Server. We
> propose to add Horovod support to MXNet. This will help our users
> achieve goal of linear scalability to 256 GPUs and beyond. Design
> proposal on cwiki:
>
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
>
> Please feel free to let me know if you have any suggestions or feedback.
>
> Regards,
> Carl
>

Reply via email to