threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401604031 @ctcyang Your running command is not correct to launch kvstore with type allreduce It should be: CPU: mpirun -n <machine-num> -ppn 1 -machinefile ./mpd.hosts python train_imagenet.py --benchmark 1 --batch-size=64 --model resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync --dtype float32 GPU: mpirun -n <machine-num> -ppn 1 -machinefile ./mpd.hosts python train_imagenet.py --benchmark 1 --batch-size=512 --model resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync_device --dtype float32 --gpus 0,1,2,3,4,5,6,7 Please try the upper one. For question of GPU reduce, I think you don't need to worry about the synchronization. The code part of Reduce and Broadcast is directly from pull and push. In pushpull implementation, I just combined the the logic of push and pull and replace the communication part from parameter server to mpi.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
