hkvision edited a comment on issue #17651: Distributed training with kvstore crashes if worker has different number of data batches URL: https://github.com/apache/incubator-mxnet/issues/17651#issuecomment-591822816 @leezu I upgraded to the just recent release `mxnet-mkl==1.6.0` and the problem still exists. @apeforest I'm running the image classification example: https://github.com/apache/incubator-mxnet/blob/master/example/gluon/image_classification.py and I'm using two cpu server nodes, each nodes installed the same mxnet-mkl version in conda environment. You can reproduce using the following command: ``` ../../tools/launch.py -n 2 -s 2 -H hosts --launcher ssh --sync-dst-dir /tmp/mxnet_job/ '/opt/work/client/anaconda3/envs/mxnet/bin/python image_classification.py --dataset dummy --model resnet18_v1 --kvstore dist_sync --log-interval 2 --batch-size 128 --epochs 3' ``` And also modify https://github.com/apache/incubator-mxnet/blob/master/example/gluon/data.py#L138 to generate a random number of batch for each DummyIter instance: ``` import random self.batches = random.randint(10, 20) ``` The program hangs once the worker having smaller batches finishes the training. If two workers have the same number of batches, then the program can finish without any issue. Hope to get your help on this issue. Feel free to point out if I'm doing anything wrong. If it is truly a bug, then is there any workaround for this? Thanks.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
