lzhou-arch commented on issue #4887: multi-node training hangs occasionally
URL: 
https://github.com/apache/incubator-mxnet/issues/4887#issuecomment-379475277
 
 
   @feiyulv Have you figured it out? I have exactly the same issue. 
   
   After I launch the job on a two-machine cluster, it hangs at 
   [14:45:23] src/van.cc:183: Barrier count for 7 : 3
   
   @szha Can you reopen this post since the issue is still pending? It would be 
great to have some pointers to debug the problem. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to