hkvision opened a new issue #17651: Distributed training with kvstore crashes 
if worker has different number of data batches
URL: https://github.com/apache/incubator-mxnet/issues/17651
 
 
   ## Description
   I'm following here https://mxnet.apache.org/api/faq/distributed_training to 
run distributed training. The examples goes fine. But in case that the data are 
not divided equally among workers, once the first worker finishes training, the 
whole program hangs and crashes.
   Is it the expected behavior for data parallelism training?
   In my case, each worker reads several files and most likely the number of 
total records for a worker differs. Then how to deal with this case for 
distributed training using dist kvstore? Is there any other option than 
splitting the whole data to exactly n parts?
   
   Thanks so much in advance.
   
   ### Error Message
   The program crashes and never exits or go on.
   
   ## To Reproduce
   ../../tools/launch.py -n 2 -s 2 -H hosts --launcher ssh --sync-dst-dir 
/tmp/mxnet_job/ 'python image_classification.py --dataset dummy --model 
resnet18_v1 --kvstore dist_sync --log-interval 2 --batch-size 128 --epochs 3'
   and in data.py, generate a random batch every time creating a DummyIter 
instance.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to