idealboy commented on issue #7396: **********!!!Error with dist-sync on two machines, Thank you URL: https://github.com/apache/incubator-mxnet/issues/7396#issuecomment-321313876 This is the output when "export PS_VERBOSE=2" [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.15.240.189, port=52099, is_recovery=0 } } [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.15.240.189, port=39902, is_recovery=0 } } [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.15.133.82, port=35715, is_recovery=0 } } [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.15.133.82, port=37993, is_recovery=0 } } [00:38:29] src/van.cc:235: assign rank=9 to node role=worker, ip=10.15.240.189, port=39902, is_recovery=0 [00:38:29] src/van.cc:235: assign rank=8 to node role=server, ip=10.15.240.189, port=52099, is_recovery=0 [00:38:29] src/van.cc:235: assign rank=10 to node role=server, ip=10.15.133.82, port=35715, is_recovery=0 [00:38:29] src/van.cc:235: assign rank=11 to node role=worker, ip=10.15.133.82, port=37993, is_recovery=0 [00:38:29] src/van.cc:136: ? => 9. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, ip=10.15.240.189, port=9118, is_recovery=0 } } [00:38:29] src/van.cc:136: ? => 11. Meta: request=0, timestamp=1, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, ip=10.15.240.189, port=9118, is_recovery=0 } } [00:38:29] src/van.cc:136: ? => 8. Meta: request=0, timestamp=2, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, ip=10.15.240.189, port=9118, is_recovery=0 } } [00:38:29] src/van.cc:136: ? => 10. Meta: request=0, timestamp=3, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, ip=10.15.240.189, port=9118, is_recovery=0 } } [00:38:29] src/van.cc:251: the scheduler is connected to 2 workers and 2 servers [00:38:29] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } [00:38:29] src/van.cc:291: Barrier count for 7 : 1 [00:38:29] src/van.cc:161: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 } [00:38:29] src/van.cc:291: Barrier count for 7 : 2 [00:38:29] src/van.cc:136: [00:38:29] src/van.cc:161: 1 => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 } [00:38:29] src/van.cc:291: Barrier count for 7 : 3 ? => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 } Now, it seems the program is blobked at somewhere, but I don't know how to debug this problerm I try to run "python train_mnist.py --network lenet --gpus 0 " on 10.15.133.82 in /tmp/mxnet, IT begin to train normally but I run "python train_mnist.py --network lenet --gpus 0" on 10.15.240.189 in /tmp/mxnet, error occurs : src/van.cc:76: Check failed: (my_node_.port) != (-1) bind failed ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services