zdaxie commented on issue #11777: Distributed Training: looks like async from 
the log although setting the kv-store=dist_device_sync
URL: 
https://github.com/apache/incubator-mxnet/issues/11777#issuecomment-405778326
 
 
   @nswamy @rahul003 The code i ran is 
incubator-mxnet/example/image-classification/train_imagenet.py, cloning from 
the master branch. The extra params is
   --benchmark 0
   --network resnet-v1
   --gpus 0,1,2,3
   --kv-store dist_device_sync
   --num-layers 50
   --batch-size 256
   --dtype float32
   --data-train /path/to/train.rec 
   --data-val /path/to/val.rec
   --num-epochs 120
   --lr-step-epochs 30,60,90
   --lr-factor 0.1
   --image-shape 3,224,224
   --lr 0.8
   BTW, I used 8 machines with 32 cards.
   
   About the last validation part, I think it's more like the main process just 
quit once it finished validating, so the results from other machines cannot be 
received. Usually there are 8 validation results after one complete epoch. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to