rahul003 edited a comment on issue #11777: Distributed Training: looks like 
async from the log although setting the kv-store=dist_device_sync
URL: 
https://github.com/apache/incubator-mxnet/issues/11777#issuecomment-405760529
 
 
   That order doesn't look normal in my experience. What network are you 
training? 
   
   Also regarding the last validation part, do you see that the training hangs 
at that point on the last epoch? That could explain why some processes go ahead 
and finish validation while others are stuck because they have more batches 
than other worker machines (sync mode). If so, can you ensure you have the 
changes from this commit in your code. 
https://github.com/apache/incubator-mxnet/pull/10435/
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to