threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894
 
 
   @rahul003 
   For GPU, I agree with your comment. But the majority code of this PR is the 
infrastructure of adding allreduce into MXNet which is shared by both CPU and 
GPU. Currently we leave the place holder for GPU for future extension. We don't 
run into any issue on GPU and we enable CPU firstly simply because we 
   currently have a lot CPU multi-node environment  We can discuss further 
about how to add GPU extension. @pengzhao-intel  Patric will shed more lights 
upon it.
   
   For resnet50, local batch size 64, global batch size 64 * 8 = 512. (8 
machine)  Yes, we trained all on CPU.
   
   In general, all reduce performance should be similar for openmpi and mpich. 
Intel mpi has better performance on all reduce performance, but it's not free 
software though it's run-time part is free.  I agree you that we select openmpi 
as default mpi if no one objects.  (we will download open mpi zip in 3rd party 
and compile it.)
   
   For proto3, I tested original kvstore type dist_sync, it works fine for 
PS-Lite.  Moreover, we just use protobuf 3.5.1. For PS-Lite it still uses 
proto2. (just need to specify its version explicitly.)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to