threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401604031
 
 
   @ctcyang 
   Your running command  is not correct to launch kvstore with type allreduce
   It should be:
   CPU:
   mpirun -n <machine-num> -ppn 1 -machinefile ./mpd.hosts python 
train_imagenet.py --benchmark 1 --batch-size=64 --model resnet --num-layers=50 
--num-epochs 1 --kv-store dist_allreduce_sync --dtype float32
   
   GPU:
   mpirun -n <machine-num> -ppn 1 -machinefile ./mpd.hosts python 
train_imagenet.py --benchmark 1 --batch-size=512 --model resnet --num-layers=50 
--num-epochs 1 --kv-store dist_allreduce_sync_device --dtype float32 --gpus 
0,1,2,3,4,5,6,7
   
   Please try the upper one. 
   
   For question of GPU reduce, I think you don't need to worry about the 
synchronization. The code part of Reduce and Broadcast is directly from pull 
and push. In pushpull implementation, I just combined the the logic of push and 
pull and replace the communication part from parameter server to mpi.  
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to