Sounds good. We(Pinar, Vandana and me) are currently prototyping and we are planning to start a discussion on dev list once we have some logical conclusion. We will share more details soon and seek feedback from the community.
Thanks, Roshani On Mon, Apr 15, 2019 at 5:30 PM Yuan Tang <terrytangy...@gmail.com> wrote: > I am cc’ing MXNet dev mailing list here. > > Thanks for the note Roshani. Look forward to seeing your contribution! > Though let’s also discuss this in MXNet dev mailing list since other people > (e.g. Carl and Lin) might be working on this as well to avoid duplicate > work. > > Best, > Yuan > > On Mon, Apr 15, 2019 at 5:51 PM Rong Ou <rong...@gmail.com> wrote: > >> Sounds great! Yes it would be nice to have some examples for MXNet. >> >> On Mon, Apr 15, 2019 at 3:36 PM Roshani Nagmote < >> roshaninagmo...@gmail.com> wrote: >> >>> Hi, >>> >>> I work on Apache MXNet and recently I used MPI-Operator to run >>> distributed training with MXNet and horovod on Kubernetes. >>> I with few other folks tried to adjust the capacity for a training job >>> based on the available workers and restart the training job from where it >>> left off if any worker goes away in between. >>> >>> To do this, we had to do a few modifications to MPI-operator. For >>> example, updating workerReplicas and launcherRole. Currently, changes are >>> in my repo and I will be making a PR on MPI-operator with these changes. >>> Also, planning to contribute few examples. I wanted to reach out to you >>> first before creating a PR. >>> >>> Please let me know what your thoughts are on this. >>> >>> Thanks, >>> Roshani >>> >>