I am cc’ing MXNet dev mailing list here.
Thanks for the note Roshani. Look forward to seeing your contribution!
Though let’s also discuss this in MXNet dev mailing list since other people
(e.g. Carl and Lin) might be working on this as well to avoid duplicate
work.
Best,
Yuan
On Mon, Apr 15, 2019 at 5:51 PM Rong Ou wrote:
> Sounds great! Yes it would be nice to have some examples for MXNet.
>
> On Mon, Apr 15, 2019 at 3:36 PM Roshani Nagmote
> wrote:
>
>> Hi,
>>
>> I work on Apache MXNet and recently I used MPI-Operator to run
>> distributed training with MXNet and horovod on Kubernetes.
>> I with few other folks tried to adjust the capacity for a training job
>> based on the available workers and restart the training job from where it
>> left off if any worker goes away in between.
>>
>> To do this, we had to do a few modifications to MPI-operator. For
>> example, updating workerReplicas and launcherRole. Currently, changes are
>> in my repo and I will be making a PR on MPI-operator with these changes.
>> Also, planning to contribute few examples. I wanted to reach out to you
>> first before creating a PR.
>>
>> Please let me know what your thoughts are on this.
>>
>> Thanks,
>> Roshani
>>
>