Sounds good. We(Pinar, Vandana and me) are currently prototyping and we are
planning to start a discussion on dev list once we have some logical
conclusion.
We will share more details soon and seek feedback from the community.

Thanks,
Roshani

On Mon, Apr 15, 2019 at 5:30 PM Yuan Tang <terrytangy...@gmail.com> wrote:

> I am cc’ing MXNet dev mailing list here.
>
> Thanks for the note Roshani. Look forward to seeing your contribution!
> Though let’s also discuss this in MXNet dev mailing list since other people
> (e.g. Carl and Lin) might be working on this as well to avoid duplicate
> work.
>
> Best,
> Yuan
>
> On Mon, Apr 15, 2019 at 5:51 PM Rong Ou <rong...@gmail.com> wrote:
>
>> Sounds great! Yes it would be nice to have some examples for MXNet.
>>
>> On Mon, Apr 15, 2019 at 3:36 PM Roshani Nagmote <
>> roshaninagmo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I work on Apache MXNet and recently I used MPI-Operator to run
>>> distributed training with MXNet and horovod on Kubernetes.
>>> I with few other folks tried to adjust the capacity for a training job
>>> based on the available workers and restart the training job from where it
>>> left off if any worker goes away in between.
>>>
>>> To do this, we had to do a few modifications to MPI-operator. For
>>> example, updating workerReplicas and launcherRole. Currently, changes are
>>> in my repo and I will be making a PR on MPI-operator with these changes.
>>> Also, planning to contribute few examples. I wanted to reach out to you
>>> first before creating a PR.
>>>
>>> Please let me know what your thoughts are on this.
>>>
>>> Thanks,
>>> Roshani
>>>
>>

Reply via email to