Congrats on the Horovod integration everyone. That's really great to hear.

On Wed, Jan 30, 2019 at 10:08 AM Lin Yuan <apefor...@gmail.com> wrote:
>
> Hi Yuan,
>
> Thanks for your interest. We have just supported MXNet in Horovod and are
> working on performance tuning and adding more examples. We are definitely
> interested in further extending it's support with Kubeflow.
>
> Let's set up some time to have a more detailed discussion.
>
> Best,
>
> Lin
>
> On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang <terrytangy...@gmail.com> wrote:
>
> > Hi,
> >
> > It's great to see MXNet-Horovod integration got merged:
> > https://github.com/uber/horovod/pull/542
> >
> > Is there any future plan for this? I've been working on Kubeflow's
> > MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it
> > would
> > be interesting to see an example of using Horovod + MXNet + Kubeflow using
> > MPI Operator. Feel free to reach out (@terrytangyuan
> > <https://github.com/terrytangyuan>) if you encounter any issues.
> >
> > Best,
> > Yuan
> >
> >
> > On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan <apefor...@gmail.com> wrote:
> >
> > > Hi Mu,
> > >
> > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> > > releasing MXNet-Horovod integration in production. We have made some
> > > changes on both MXNet and Horovod sides. The changes on MXNet side have
> > > mostly been merged and we are working to merge code to horovod repo. We
> > > will send a design doc to you for review again next week.
> > >
> > > Thanks for your feedback,
> > >
> > > Lin
> > >
> > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li <muli....@gmail.com> wrote:
> > >
> > > > Thanks for your contribution, Carl.
> > > >
> > > > I remember I left a comment on the proposal, but today I found it was
> > > > disappeared. My suggestion is trying best to not change the existing
> > API.
> > > > The reason is that we need to change all trainers on the frontend that
> > > uses
> > > > the existing kvstore APIs, which may cause confusion to users.
> > > >
> > > > The current proposal wants add the following 4 APIs into kvstore:
> > > >
> > > >
> > > >    -
> > > >
> > > >    kv.pushpull
> > > >    -
> > > >
> > > >    kv.broadcast
> > > >    -
> > > >
> > > >    kv.local_rank
> > > >    -
> > > >
> > > >    kv.num_local_workers
> > > >
> > > >
> > > > Pushpull can be done with a sequential push and pull, you can do
> > nothing
> > > in
> > > > push and put all workloads into pushpull. Broadcast can be implemented
> > by
> > > > pull.
> > > >
> > > > What's local workers? GPUs in the single machine? If so, we can query
> > it
> > > > directly.
> > > >
> > > >
> > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <carl14...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Currently, MXNet distributed can only be done using parameter server.
> > > > > Horovod is an open-source distributed training framework that has
> > > > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > > > propose to add Horovod support to MXNet. This will help our users
> > > > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > > > proposal on cwiki:
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > > > >
> > > > > Please feel free to let me know if you have any suggestions or
> > > feedback.
> > > > >
> > > > > Regards,
> > > > > Carl
> > > > >
> > > >
> > >
> >

Reply via email to