hkvision edited a comment on issue #17651: Distributed training with kvstore 
crashes if worker has different number of data batches
URL: 
https://github.com/apache/incubator-mxnet/issues/17651#issuecomment-591822816
 
 
   @leezu I upgraded to the just recent release `mxnet-mkl==1.6.0` and the 
problem still exists.
   @apeforest I'm running the image classification example: 
https://github.com/apache/incubator-mxnet/blob/master/example/gluon/image_classification.py
 and I'm using two cpu server nodes, each nodes installed the same mxnet-mkl 
version in conda environment.
   You can reproduce using the following command:
   ```
   ../../tools/launch.py -n 2 -s 2 -H hosts --launcher ssh --sync-dst-dir 
/tmp/mxnet_job/ '/opt/work/client/anaconda3/envs/mxnet/bin/python 
image_classification.py --dataset dummy --model resnet18_v1 --kvstore dist_sync 
--log-interval 2 --batch-size 128 --epochs 3'
   ```
   And also modify 
https://github.com/apache/incubator-mxnet/blob/master/example/gluon/data.py#L138
 to generate a random number of batch for each DummyIter instance:
   ```
   import random
   self.batches = random.randint(10, 20)
   ```
   The program hangs once the worker having smaller batches finishes the 
training.
   Hope to get your help on this issue. Feel free to point out if I'm doing 
anything wrong. If it is truly a bug, then is there any workaround for this? 
Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to