Re: Horovod-MXNet Integration

Yuan Tang Wed, 30 Jan 2019 07:42:33 -0800

Hi,

It's great to see MXNet-Horovod integration got merged:
https://github.com/uber/horovod/pull/542


Is there any future plan for this? I've been working on Kubeflow's
MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it would
be interesting to see an example of using Horovod + MXNet + Kubeflow using
MPI Operator. Feel free to reach out (@terrytangyuan
<https://github.com/terrytangyuan>) if you encounter any issues.

Best,
Yuan


On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan <[email protected]> wrote:

> Hi Mu,
>
> Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> releasing MXNet-Horovod integration in production. We have made some
> changes on both MXNet and Horovod sides. The changes on MXNet side have
> mostly been merged and we are working to merge code to horovod repo. We
> will send a design doc to you for review again next week.
>
> Thanks for your feedback,
>
> Lin
>
> On Wed, Oct 31, 2018 at 12:03 PM Mu Li <[email protected]> wrote:
>
> > Thanks for your contribution, Carl.
> >
> > I remember I left a comment on the proposal, but today I found it was
> > disappeared. My suggestion is trying best to not change the existing API.
> > The reason is that we need to change all trainers on the frontend that
> uses
> > the existing kvstore APIs, which may cause confusion to users.
> >
> > The current proposal wants add the following 4 APIs into kvstore:
> >
> >
> >    -
> >
> >    kv.pushpull
> >    -
> >
> >    kv.broadcast
> >    -
> >
> >    kv.local_rank
> >    -
> >
> >    kv.num_local_workers
> >
> >
> > Pushpull can be done with a sequential push and pull, you can do nothing
> in
> > push and put all workloads into pushpull. Broadcast can be implemented by
> > pull.
> >
> > What's local workers? GPUs in the single machine? If so, we can query it
> > directly.
> >
> >
> > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > Currently, MXNet distributed can only be done using parameter server.
> > > Horovod is an open-source distributed training framework that has
> > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > propose to add Horovod support to MXNet. This will help our users
> > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > proposal on cwiki:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > >
> > > Please feel free to let me know if you have any suggestions or
> feedback.
> > >
> > > Regards,
> > > Carl
> > >
> >
>

Re: Horovod-MXNet Integration

Reply via email to