threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401702058
 
 
   @eric-haibin-lin 
   1. dist allreduce only support mpirun just like hovorod. I have documented 
this in the design doc. Do I need to add it elsewhere? 
   2. It's not easy to use cluster=mpi through launcher. Because the version of 
mpirun and the mpi library used in mxnet should be strictly match (e.g. mpich's 
mpirun cannot work with intel mpi library's barrier) Unlike parameter server, 
it can use many version of mpirun because it just used its functionality of 
fork process  in multi-machine.
   3. I will modify the code according to your review comment.
   4. I will add it to the jenkins file in tests/nightly with macro's help.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to