threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401702058 @eric-haibin-lin 1. dist allreduce only support mpirun just like hovorod. I have documented this in the design doc. Do I need to add it elsewhere? 2. It's not easy to use cluster=mpi through launcher. Because the version of mpirun and the mpi library used in mxnet should be strictly match (e.g. mpich's mpirun cannot work with intel mpi library's barrier) Unlike parameter server, it can use many version of mpirun because it just used its functionality of fork process in multi-machine. 3. I will modify the code according to your review comment. 4. I will add it to the jenkins file in tests/nightly with macro's help.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
