Ishitori commented on issue #13644: aws distributed training struck ! URL: https://github.com/apache/incubator-mxnet/issues/13644#issuecomment-464405780 @Davdi, did you manage to find the issue? To me it seems that it should be connected to connectivity, because when you remove `--kvstore dist_sync` the default value is used, which for this tutorial is `device`. I also notice that you set n = 1, but mention "2 instances". Which configuration do you try to achieve? And what is the content of your hosts file?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
