rahul003 commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386512681 - Implementation only for CPU is very restrictive. Are you also trying to implement for GPU? Are you running into any issues? This looks helpful https://devblogs.nvidia.com/introduction-cuda-aware-mpi/. Scaling efficiency is something we need a lot of work on, as my runs with large number of machines are showing. It would be awesome if we can have MPI for GPU as well. I'm happy to offer any help needed to get this done. - You mentioned efficiency for resnet50, but for what batch size? Are all these training numbers on CPU? - Leaving the flexibility of which MPI to user is good for advanced users. But for lot of people it adds unnecessary complexity IMO. How do the different MPI frameworks differ? Is one framework more performant in your experience? Should we choose that as default and provide a way for user to replace that with their own framework using the environment variable or make flag. I would really recommend this option. - @reminisce This PR tries to change protobuf to proto3. I wanted to draw your attention to that change. Could you review that in light of your plan to move to proto3 for mxboard, etc. Also ccing @eric-haibin-lin for reviews
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
