irishjars opened a new issue #14774: Port binding failed in distributed 
training example
URL: https://github.com/apache/incubator-mxnet/issues/14774
 
 
   
   ## Description
   I'm trying to run the distributed training example in the mxnet repository 
(https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training)
 but i'm having trouble with the port binding.
   
   ## Environment info (Required)
   
   ```
   What to do:
   1. Download the diagnosis script from 
https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
   2. Run the script using `python diagnose.py` and paste its output here.
   
   ```
   
   ----------Python Info----------
   Version      : 3.6.5
   Compiler     : GCC 7.2.0
   Build        : ('default', 'Apr 29 2018 16:14:56')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 10.0.1
   Directory    : 
/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.3.0
   Directory    : 
/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
   Commit Hash   : b3be92f4a48bce62a5a8424271871c2f81c8f7f1
   ----------System Info----------
   Platform     : Linux-4.15.0-041500-generic-x86_64-with-debian-stretch-sid
   system       : Linux
   node         : ip-172-31-29-240
   release      : 4.15.0-041500-generic
   version      : #201802011154 SMP Thu Feb 1 11:55:45 UTC 2018
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                8
   On-line CPU(s) list:   0-7
   Thread(s) per core:    2
   Core(s) per socket:    4
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
   Stepping:              1
   CPU MHz:               2694.905
   CPU max MHz:           3000.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              4600.12
   Hypervisor vendor:     Xen
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              46080K
   NUMA node0 CPU(s):     0-7
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm 
constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq 
ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes 
xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault 
invpcid_single pti retpoline fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm 
rdseed adx xsaveopt
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0021 
sec, LOAD: 0.5941 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1178 sec, LOAD: 
0.5150 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.4319 sec, LOAD: 
0.5457 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0546 sec, LOAD: 0.4905 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0152 sec, LOAD: 
0.2675 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0211 sec, 
LOAD: 0.0892 sec.
   
   
   Package used (Python/R/Scala/Julia):
   I'm using python 3.6
   
   
   ## Build info (Required if built from source)
   
   Compiler (gcc/clang/mingw/visual studio):
   
   MXNet commit hash:
   c2ba51b742229b245367a347f2d2cc0e9c8232a2
   
   Build config:
   (Paste the content of config.mk, or the build command.)
   
   ## Error Message:
   Traceback (most recent call last):
     File 
"/home/ubuntu/605_experiment/incubator-mxnet/example/distributed_training/cifar10_dist.py",
 line 27, in <module>
       import mxnet as mx
     File 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/__init__.py", line 
57, in <module>
       from . import kvstore_server
     File 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/kvstore_server.py", 
line 85, in <module>
       _init_kvstore_server_module()
     File 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/kvstore_server.py", 
line 82, in _init_kvstore_server_module
       server.run()
     File 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/kvstore_server.py", 
line 73, in run
       check_call(_LIB.MXKVStoreRunServer(self.handle, 
_ctrl_proto(self._controller()), None))
     File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/base.py", 
line 252, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [01:41:12] src/van.cc:291: Check failed: 
(my_node_.port) != (-1) bind failed
   
   Stack trace returned 10 entries:
   [bt] (0) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f23c2) 
[0x7f71dd54d3c2]
   [bt] (1) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f2988) 
[0x7f71dd54d988]
   [bt] (2) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38e9e7a)
 [0x7f71e0a44e7a]
   [bt] (3) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38f3d7a)
 [0x7f71e0a4ed7a]
   [bt] (4) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38e5069)
 [0x7f71e0a40069]
   [bt] (5) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x324d438)
 [0x7f71e03a8438]
   [bt] (6) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x7f)
 [0x7f71e018635f]
   [bt] (7) 
/home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)
 [0x7f720ab58ec0]
   [bt] (8) 
/home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)
 [0x7f720ab5887d]
   [bt] (9) 
/home/ubuntu/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)
 [0x7f720ad6de2e]
   
   
   
   ## Minimum reproducible example
   
   
https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training
   
   ## Steps to reproduce
   
   python ~/605_experiment/incubator-mxnet/tools/launch.py -n 2 -s 2 -H hosts 
--sync-dst-dir 
/home/ubuntu/605_experiment/incubator-mxnet/example/distributed_training 
--launcher ssh "python 
/home/ubuntu/605_experiment/incubator-mxnet/example/distributed_training/cifar10_dist.py"
   
   ## What have you tried to solve it?
   
   1. Using private IPs of each instance instead of hostname. I have three 
instances, one host named s0, and other two named d1 and d2. My ~/.ssh/config 
looks like:
   Host s0
       HostName 172.31.29.240
       user ubuntu
       IdentityFile /home/ubuntu/ScalableML.pem
       IdentitiesOnly yes
   Host d1
       HostName 172.31.34.204
       user ubuntu
       IdentityFile /home/ubuntu/ScalableML.pem
       IdentitiesOnly yes
   
   Host d2
       HostName 172.31.33.222
       user ubuntu
       IdentityFile /home/ubuntu/ScalableML.pem
       IdentitiesOnly yes
   
   content of hosts file:
   
   d1
   d2
   
   all instances can ssh to each other without requiring authentication.
   All TCP and SSH traffic is allowed inbound and outbound.
   
   2. No python processes are running in any instances at the time of running 
launch.py. I'm launching hosts.py on instance s0.
   
   3. the example runs when processes are local to one instance (launcher local 
option) but fails with ssh with three instances.
   
   4. The example has store = kv.create(‘dist’). I tried with store = 
kv.create(‘dist_async’) but i'm running into the same issue.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to