ctcyang edited a comment on issue #10696: [MXNET-366]Extend MXNet Distributed Training by AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401586011 I tried training with synthetic data with 8 machines x 8 GPUs (64 GPUs total) in float32 using: `../../tools/launch.py -n 8 -H ~/host_file2 --launcher ssh --sync-dst-dir /tmp/mxnet_job/ python train_imagenet.py --benchmark 1 --batch-size=512 --model resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync_device --dtype float32 --gpus 0,1,2,3,4,5,6,7`. An error happens after 300 minibatches: [error5.txt](https://github.com/apache/incubator-mxnet/files/2152540/error5.txt). This minibatch size is not considered large for 8x GPUs (64 per GPU), so GPU memory allocation error must have to do with this PR rather than excessive batch size. The CPU equivalent version does not have problem on 8 machines: `../../tools/launch.py -n 8 -H ~/efs/host_file --launcher ssh --sync-dst-dir /tmp/mxnet_job/ python train_imagenet.py --benchmark 1 --batch-size=64 --model resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync --dtype float32`. Strangely, no synchronization problem happens with CPU version: [log5.txt](https://github.com/apache/incubator-mxnet/files/2152567/log5.txt) One possible reason for this error is that in the log file, I also see that for the above 8 node experiment, the initialization: `INFO:root:start with arguments Namespace(batch_size=512, ...)` for 1 node does not show in the log until 6 nodes are already at minibatch 140 and for 1 node doesn't start until minibatch 420. I understand that with MPI, sometimes the log files are not synchronized perfectly, but this seems a bit excessive. For this 2-level AllReduce scheme, this raises the question of what synchronizes the GPU Reduce on Lines 206 and 208 of `/src/kvstore/kvstore_dist_sync_allreduce.h`. I don't see a barrier here, but my intuition is that one is required to: (i) tell AllReduce to wait until all GPUs are finished (line 206), (ii) tell GPUs to wait for AllReduce result before starting execution of GPU Broadcast (line 208).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
