threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894 @rahul003 For GPU, I agree with your comment. Currently we leave the place holder for GPU for future extension. @pengzhao-intel Patric will shed more lights upon it. For resnet50, local batch size 64, global batch size 64 * 8 = 512. (8 machine) Yes, we trained all on CPU. In general, all reduce performance should be similar for openmpi and mpich. Intel mpi has better performance on all reduce performance, but it's not free software though it's run-time part is free. I agree you that we select openmpi as default mpi if no one objects. (we will download open mpi zip in 3rd party and compile it.) For proto3, I tested original kvstore type dist_sync, it works fine for PS-Lite.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
