[GitHub] ctcyang edited a comment on issue #10696: [MXNET-366]Extend MXNet Distributed Training by AllReduce

GitBox Sat, 30 Jun 2018 23:35:26 -0700

ctcyang edited a comment on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401586011
 
 
   I tried training with synthetic data with 8 machines x 8 GPUs (64 GPUs 
total) in float32 using: `../../tools/launch.py -n 8 -H ~/host_file2 --launcher 
ssh --sync-dst-dir /tmp/mxnet_job/ python train_imagenet.py --benchmark 1 
--batch-size=512 --model resnet --num-layers=50 --num-epochs 1 --kv-store 
dist_allreduce_sync_device --dtype float32 --gpus 0,1,2,3,4,5,6,7`. An error 
happens after 300 minibatches: 
[error5.txt](https://github.com/apache/incubator-mxnet/files/2152540/error5.txt).
 This minibatch size is not considered large for 8x GPUs (64 per GPU), so GPU 
memory allocation error must have to do with this PR rather than excessive 
batch size.
   
   The CPU equivalent version does not have problem on 8 machines: 
`../../tools/launch.py -n 8 -H ~/efs/host_file --launcher ssh --sync-dst-dir 
/tmp/mxnet_job/ python train_imagenet.py --benchmark 1 --batch-size=64 --model 
resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync --dtype 
float32`. Strangely, no synchronization problem happens with CPU version: 
[log5.txt](https://github.com/apache/incubator-mxnet/files/2152567/log5.txt)
   
   One possible reason for this error is that in the log file, I see that for 
the above 8 node experiment, the initialization: `INFO:root:start with 
arguments Namespace(batch_size=512, ...)` for 1 node does not show in the log 
until 6 nodes are already at minibatch 140 (line 119) and for 1 node doesn't 
start until minibatch 420 (line 280). 
   
   I understand that with MPI, sometimes the log files are not synchronized 
perfectly, but this seems a bit excessive and the CPU log does not show this 
synchronization problem with logging. For this 2-level AllReduce scheme, this 
raises the question of what synchronizes the GPU Reduce on Lines 192 and 208 of 
`/src/kvstore/kvstore_dist_sync_allreduce.h`. I don't see a barrier here, but 
my intuition is that one is required to: (i) tell all nodes doing AllReduce to 
wait until all GPUs are finished (line 192), (ii) tell GPUs at all nodes to 
wait for AllReduce result before starting execution of GPU Broadcast (line 
208). 
   
   Or perhaps in the GPU case, the variable is not correctly being captured by 
the MXNet dependency engine and GPUs are not waiting for all results to be 
ready before executing `comm_->Reduce` and `comm_->Broadcast`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] ctcyang edited a comment on issue #10696: [MXNET-366]Extend MXNet Distributed Training by AllReduce

Reply via email to