ctcyang edited a comment on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401586011
 
 
   I tried training with synthetic data with 8 machines x 8 GPUs (64 GPUs 
total) in float32 using: `../../tools/launch.py -n 8 -H ~/host_file2 --launcher 
ssh --sync-dst-dir /tmp/mxnet_job/ python train_imagenet.py --benchmark 1 
--batch-size=512 --model resnet --num-layers=50 --num-epochs 1 --kv-store 
dist_allreduce_sync_device --dtype float32 --gpus 0,1,2,3,4,5,6,7`. An error 
happens after 300 minibatches: 
[error5.txt](https://github.com/apache/incubator-mxnet/files/2152540/error5.txt).
 This minibatch size is not considered large for 8x GPUs (64 per GPU), so GPU 
memory allocation error must have to do with this PR rather than excessive 
batch size.
   
   The CPU equivalent version does not have problem on 8 machines: 
`../../tools/launch.py -n 8 -H ~/efs/host_file --launcher ssh --sync-dst-dir 
/tmp/mxnet_job/ python train_imagenet.py --benchmark 1 --batch-size=64 --model 
resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync --dtype 
float32`. Strangely, no synchronization problem happens with CPU version: 
[log5.txt](https://github.com/apache/incubator-mxnet/files/2152567/log5.txt)
   
   One possible reason for this error is that in the log file, I also see that 
for the above 8 node experiment, the initialization: `INFO:root:start with 
arguments Namespace(batch_size=512, ...)` for 1 node does not show in the log 
until 6 nodes are already at minibatch 140 and for 1 node doesn't start until 
minibatch 420. 
   
   I understand that with MPI, sometimes the log files are not synchronized 
perfectly, but this seems a bit excessive. For this 2-level AllReduce scheme, 
this raises the question of what synchronizes the GPU Reduce on Lines 192 and 
208 of `/src/kvstore/kvstore_dist_sync_allreduce.h`. I don't see a barrier 
here, but my intuition is that one is required to: (i) tell all nodes doing 
AllReduce to wait until all GPUs are finished (line 192), (ii) tell GPUs at all 
nodes to wait for AllReduce result before starting execution of GPU Broadcast 
(line 208).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to