[GitHub] rahul003 commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

GitBox Thu, 03 May 2018 23:07:38 -0700

rahul003 commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386512681
 
 
   - Implementation only for CPU is very restrictive. Are you also trying to 
implement for GPU? Are you running into any issues? This looks helpful 
https://devblogs.nvidia.com/introduction-cuda-aware-mpi/. Scaling efficiency is 
something we need a lot of work on, as my runs with large number of machines 
are showing. It would be awesome if we can have MPI for GPU as well. I'm happy 
to offer any help needed to get this done. 
   
   - You mentioned efficiency for resnet50, but for what batch size? Are all 
these training numbers on CPU? 
   
   - Leaving the flexibility of which MPI to user is good for advanced users. 
But for lot of people it adds unnecessary complexity IMO. How do the different 
MPI frameworks differ? Is one framework more performant in your experience? 
Should we choose that as default and provide a way for user to replace that 
with their own framework using the environment variable or make flag. I would 
really recommend this option.
   
   - @reminisce This PR tries to change protobuf to proto3. I wanted to draw 
your attention to that change. Could you review that in light of your plan to 
move to proto3 for mxboard, etc. 
   
   Also ccing @eric-haibin-lin for reviews


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] rahul003 commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

Reply via email to