hkvision opened a new issue #17651: Distributed training with kvstore crashes if worker has different number of data batches URL: https://github.com/apache/incubator-mxnet/issues/17651 ## Description I'm following here https://mxnet.apache.org/api/faq/distributed_training to run distributed training. The examples goes fine. But in case that the data are not divided equally among workers, once the first worker finishes training, the whole program hangs and crashes. Is it the expected behavior for data parallelism training? In my case, each worker reads several files and most likely the number of total records for a worker differs. Then how to deal with this case for distributed training using dist kvstore? Is there any other option than splitting the whole data to exactly n parts? Thanks so much in advance. ### Error Message The program crashes and never exits or go on. ## To Reproduce ../../tools/launch.py -n 2 -s 2 -H hosts --launcher ssh --sync-dst-dir /tmp/mxnet_job/ 'python image_classification.py --dataset dummy --model resnet18_v1 --kvstore dist_sync --log-interval 2 --batch-size 128 --epochs 3' and in data.py, generate a random batch every time creating a DummyIter instance.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
