ctcyang edited a comment on issue #10696: [MXNET-366]Extend MXNet Distributed Training by AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-401586011 I tried training with synthetic data with 8 machines x 8 GPUs (64 GPUs total) in float32 using: `../../tools/launch.py -n 8 -H ~/host_file2 --launcher ssh --sync-dst-dir /tmp/mxnet_job/ python train_imagenet.py --benchmark 1 --batch-size=512 --model resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync_device --dtype float32 --gpus 0,1,2,3,4,5,6,7`. An error happens after 300 minibatches: [error5.txt](https://github.com/apache/incubator-mxnet/files/2152540/error5.txt). This minibatch size is not considered large for 8x GPUs (64 per GPU), so GPU memory allocation error must have to do with this PR rather than excessive batch size. The CPU equivalent version does not have problem on 8 machines: `../../tools/launch.py -n 8 -H ~/efs/host_file --launcher ssh --sync-dst-dir /tmp/mxnet_job/ python train_imagenet.py --benchmark 1 --batch-size=64 --model resnet --num-layers=50 --num-epochs 1 --kv-store dist_allreduce_sync --dtype float32`. Strangely, no synchronization problem happens with CPU version: [log5.txt](https://github.com/apache/incubator-mxnet/files/2152567/log5.txt) One possible reason for this error is that in the log file, I see that for the above 8 node experiment, the initialization: `INFO:root:start with arguments Namespace(batch_size=512, ...)` for 1 node does not show in the log until 6 nodes are already at minibatch 140 (line 119) and for 1 node doesn't start until minibatch 420 (line 280). I understand that with MPI, sometimes the log files are not synchronized perfectly, but this seems a bit excessive and the CPU log does not show this synchronization problem with logging. For this 2-level AllReduce scheme, this raises the question of what synchronizes the GPU Reduce on Lines 192 and 208 of `/src/kvstore/kvstore_dist_sync_allreduce.h`. I don't see a barrier here, but my intuition is that one is required to: (i) tell all nodes doing AllReduce to wait until all GPUs are finished (line 192), (ii) tell GPUs at all nodes to wait for AllReduce result before starting execution of GPU Broadcast (line 208). Or perhaps in the GPU case, the variable is not correctly being captured by the MXNet dependency engine and GPUs are not waiting for all results to be ready before executing `comm_->Reduce` and `comm_->Broadcast`?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
