idealboy opened a new issue #7412: About van when using distribute training
URL: https://github.com/apache/incubator-mxnet/issues/7412
 
 
   For bugs or installation issues, please provide the following information.
   The more information you provide, the more likely people will be able to 
help you.
   
   ## Environment info
   Operating System:
   centos6.4 centos7.2
   Compiler:
   gcc-4.8.5
   Package used (Python/R/Scala/Julia):
   python
   MXNet version:
   0.9.5
   Or if installed from source:
   Yes
   MXNet commit hash (`git rev-parse HEAD`):
   
   If you are using python package, please provide
   python-2.7
   Python version and distribution:
   python-2.7
   If you are using R package, please provide
   
   R `sessionInfo()`:
   
   ## Error Message:
   
   thr program is looping in van.cc:Start function  when mx.kvstore.create
   
   while(!ready_){ LOG(INFO) << "wait ready"}
   
   why ?
   
   
   Below are some debug info when run launch.py with train_mnist.py, some log 
are added by me in thr source 
   
   
   before MXKVStoreCreate================================
   [13:36:47] src/kvstore/./kvstore_dist.h:36: KVStoreDist
   [13:36:47] src/kvstore/./kvstore_dist.h:51: is worker node7
   [13:36:47] src/kvstore/./kvstore_dist.h:53: is worker node8
   after MXKVStoreCreate================================
   [13:36:47] src/postoffice.cc:26: Postoffice Constructor
   [13:36:47] src/postoffice.cc:83: Add customer
   [13:36:47] src/postoffice.cc:61: Van Start
   [13:36:47] src/van.cc:40: is_scheduler_:0 scheduler_.hostname:10.15.240.189 
scheduler_.port:9191 scheduler_.role:2
   [13:36:47] src/van.cc:53: DMLC_INTERFACE:eth0
   [13:36:47] src/van.cc:58: DMLC_INTERFACE:10.15.240.189
   [13:36:47] src/van.cc:66: Available port:57097
   [13:36:47] src/van.cc:70: ip.empty:10.15.240.189
   [13:36:47] src/van.cc:71: port.empty:57097
   [13:36:47] src/van.cc:89: connect to scheduler
   [13:36:47] src/van.cc:172: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=server, ip=10.15.240.189, port=57097, is_recovery=0 } 
}
   [13:36:47] src/postoffice.cc:168: get dead nodes
   
   ......
   
   before MXKVStoreCreate================================
   [13:36:47] src/kvstore/./kvstore_dist.h:36: KVStoreDist
   [13:36:47] src/kvstore/./kvstore_dist.h:38: is worker node1
   [13:36:47] src/postoffice.cc:26: Postoffice Constructor
   [13:36:47] src/postoffice.cc:83: Add customer
   [13:36:47] src/kvstore/./kvstore_dist.h:40: is worker node2
   [13:36:47] src/postoffice.cc:61: Van Start
   [13:36:47] src/van.cc:40: is_scheduler_:0 scheduler_.hostname:10.15.240.189 
scheduler_.port:9191 scheduler_.role:2
   [13:36:47] src/van.cc:53: DMLC_INTERFACE:eth0
   [13:36:47] src/van.cc:58: DMLC_INTERFACE:10.15.240.189
   [13:36:47] src/van.cc:66: Available port:34210
   [13:36:47] src/van.cc:70: ip.empty:10.15.240.189
   [13:36:47] src/van.cc:71: port.empty:34210
   [13:36:47] src/van.cc:89: connect to scheduler
   [13:36:47] src/van.cc:172: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, ip=10.15.240.189, port=34210, is_recovery=0 } 
}
   [13:36:47] src/postoffice.cc:168: get dead nodes
   
   
   .......
   
   
   before MXKVStoreCreate================================
   after MXKVStoreCreate================================
   [05:47:37] src/kvstore/./kvstore_dist.h:36: KVStoreDist
   [05:47:37] src/kvstore/./kvstore_dist.h:51: is worker node7
   [05:47:37] src/kvstore/./kvstore_dist.h:53: is worker node8
   [05:47:37] src/postoffice.cc:26: Postoffice Constructor
   [05:47:37] src/postoffice.cc:83: Add customer
   [05:47:37] src/postoffice.cc:61: Van Start
   [05:47:37] src/van.cc:40: is_scheduler_:0 scheduler_.hostname:10.15.240.189 
scheduler_.port:9191 scheduler_.role:2
   [05:47:37] src/van.cc:53: DMLC_INTERFACE:eth0
   [05:47:37] src/van.cc:58: DMLC_INTERFACE:10.15.133.82
   [05:47:37] src/van.cc:66: Available port:39259
   [05:47:37] src/van.cc:70: ip.empty:10.15.133.82
   [05:47:37] src/van.cc:71: port.empty:39259
   [05:47:37] src/van.cc:89: connect to scheduler
   [13:36:47] src/van.cc:172: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=server, ip=10.15.133.82, port=39259, is_recovery=0 } }
   [13:36:47] src/postoffice.cc:168: get dead nodes
   
   
   ........
   
   
   before MXKVStoreCreate================================
   [05:47:38] src/kvstore/./kvstore_dist.h:36: KVStoreDist
   [05:47:38] src/kvstore/./kvstore_dist.h:38: is worker node1
   [05:47:38] src/postoffice.cc:26: Postoffice Constructor
   [05:47:38] src/postoffice.cc:83: Add customer
   [05:47:38] src/kvstore/./kvstore_dist.h:40: is worker node2
   [05:47:38] src/postoffice.cc:61: Van Start
   [05:47:38] src/van.cc:40: is_scheduler_:0 scheduler_.hostname:10.15.240.189 
scheduler_.port:9191 scheduler_.role:2
   [05:47:38] src/van.cc:53: DMLC_INTERFACE:eth0
   [05:47:38] src/van.cc:58: DMLC_INTERFACE:10.15.133.82
   [05:47:38] src/van.cc:66: Available port:52447
   [05:47:38] src/van.cc:70: ip.empty:10.15.133.82
   [05:47:38] src/van.cc:71: port.empty:52447
   [05:47:38] src/van.cc:89: connect to scheduler
   [13:36:47] src/van.cc:172: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, ip=10.15.133.82, port=52447, is_recovery=0 } }
   [13:36:47] src/postoffice.cc:168: get dead nodes
   [13:36:47] src/van.cc:246: assign rank=9 to node role=worker, 
ip=10.15.240.189, port=34210, is_recovery=0
   [13:36:47] src/van.cc:246: assign rank=8 to node role=server, 
ip=10.15.240.189, port=57097, is_recovery=0
   [13:36:47] src/van.cc:246: assign rank=10 to node role=server, 
ip=10.15.133.82, port=39259, is_recovery=0
   [13:36:47] src/van.cc:246: assign rank=11 to node role=worker, 
ip=10.15.133.82, port=52447, is_recovery=0
   [13:36:47] src/van.cc:147: ? => 9. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=34210, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=57097, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=39259, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=52447, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9191, is_recovery=0 } }
   [13:36:47] src/van.cc:147: ? => 11. Meta: request=0, timestamp=1, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=34210, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=57097, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=39259, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=52447, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9191, is_recovery=0 } }
   [13:36:47] src/van.cc:147: ? => 8. Meta: request=0, timestamp=2, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=34210, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=57097, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=39259, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=52447, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9191, is_recovery=0 } }
   [13:36:47] src/van.cc:147[13:36:47] src/postoffice.cc:168: get dead nodes
   : ? => 10. Meta: request=0, timestamp=3, control={ cmd=ADD_NODE, node={ 
role=worker, id=9, ip=10.15.240.189, port=34210, is_recovery=0 role=server, 
id=8, ip=10.15.240.189, port=57097, is_recovery=0 role=server, id=10, 
ip=10.15.133.82, port=39259, is_recovery=0 role=worker, id=11, ip=10.15.133.82, 
port=52447, is_recovery=0 role=scheduler, id=1, ip=10.15.240.189, port=9191, 
is_recovery=0 } }[13:36:47] src/postoffice.cc:168: get dead nodes
   
   [13:36:47] src/van.cc:262: the scheduler is connected to 2 workers and 2 
servers
   [13:36:47] src/postoffice.cc:72: Van Start Do Barrier
   [13:36:47] src/kvstore/./kvstore_dist.h:42: is worker node3
   [13:36:47] src/kvstore/./kvstore_dist.h:44: is worker node4
   [13:36:47] src/postoffice.cc:117: Barrier role:1
   [13:36:47] src/postoffice.cc:136: Barrier van send
   [13:36:47] src/postoffice.cc:72: Van Start Do Barrier
   [13:36:47] src/postoffice.cc:117: Barrier role:0
   [13:36:47] src/postoffice.cc:136: Barrier van send
   [13:36:47] src/postoffice.cc:72: Van Start Do Barrier
   [13:36:47] src/postoffice.cc:117: Barrier role:2
   [13:36:47] src/van.cc:147: ? => 1. Meta: request=1, timestamp=4, control={ 
cmd=BARRIER, barrier_group=7 }
   [13:36:47] src/postoffice.cc:136: Barrier van send
   [13:36:47] src/van.cc:172: 1 => 1. Meta: request=1, timestamp=4, control={ 
cmd=BARRIER, barrier_group=7 }
   [13:36:47] src/van.cc:302: Barrier count for 7 : 1
   
   
   
   ## Minimum reproducible example
   if you are using your own code, please provide a short script that 
reproduces the error.
   
   ## Steps to reproduce
   or if you are running standard examples, please provide the commands you 
have run that lead to the error.
   
   1.
   2.
   3.
   
   ## What have you tried to solve it?
   
   1.
   2.
   3.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to