idealboy commented on issue #7396: **********!!!Error with dist-sync on two 
machines, Thank you
URL: 
https://github.com/apache/incubator-mxnet/issues/7396#issuecomment-321313876
 
 
   This is the output when "export PS_VERBOSE=2"
   
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=server, ip=10.15.240.189, port=52099, is_recovery=0 } 
}
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, ip=10.15.240.189, port=39902, is_recovery=0 } 
}
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=server, ip=10.15.133.82, port=35715, is_recovery=0 } }
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, ip=10.15.133.82, port=37993, is_recovery=0 } }
   [00:38:29] src/van.cc:235: assign rank=9 to node role=worker, 
ip=10.15.240.189, port=39902, is_recovery=0
   [00:38:29] src/van.cc:235: assign rank=8 to node role=server, 
ip=10.15.240.189, port=52099, is_recovery=0
   [00:38:29] src/van.cc:235: assign rank=10 to node role=server, 
ip=10.15.133.82, port=35715, is_recovery=0
   [00:38:29] src/van.cc:235: assign rank=11 to node role=worker, 
ip=10.15.133.82, port=37993, is_recovery=0
   [00:38:29] src/van.cc:136: ? => 9. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:136: ? => 11. Meta: request=0, timestamp=1, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:136: ? => 8. Meta: request=0, timestamp=2, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:136: ? => 10. Meta: request=0, timestamp=3, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:251: the scheduler is connected to 2 workers and 2 
servers
   [00:38:29] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=1, control={ 
cmd=BARRIER, barrier_group=7 }
   [00:38:29] src/van.cc:291: Barrier count for 7 : 1
   [00:38:29] src/van.cc:161: 8 => 1. Meta: request=1, timestamp=1, control={ 
cmd=BARRIER, barrier_group=7 }
   [00:38:29] src/van.cc:291: Barrier count for 7 : 2
   [00:38:29] src/van.cc:136: [00:38:29] src/van.cc:161: 1 => 1. Meta: 
request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 }
   [00:38:29] src/van.cc:291: Barrier count for 7 : 3
   ? => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 
}
   
   
   Now, it seems the program is blobked at somewhere, but I don't know how to 
debug this problerm
   
   I try to run "python train_mnist.py --network lenet --gpus 0 " on 
10.15.133.82 in /tmp/mxnet, IT begin to train normally
   
   but I run "python train_mnist.py --network lenet --gpus 0" on 10.15.240.189 
in /tmp/mxnet, error occurs :  src/van.cc:76: Check failed: (my_node_.port) != 
(-1) bind failed
   
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to