irishjars opened a new issue #14774: Port binding failed in distributed training example URL: https://github.com/apache/incubator-mxnet/issues/14774 ## Description I'm trying to run the distributed training example in the mxnet repository (https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training) but i'm having trouble with the port binding. ## Environment info (Required) ``` What to do: 1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py 2. Run the script using `python diagnose.py` and paste its output here. ``` ----------Python Info---------- Version : 3.6.5 Compiler : GCC 7.2.0 Build : ('default', 'Apr 29 2018 16:14:56') Arch : ('64bit', '') ------------Pip Info----------- Version : 10.0.1 Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip ----------MXNet Info----------- Version : 1.3.0 Directory : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet Commit Hash : b3be92f4a48bce62a5a8424271871c2f81c8f7f1 ----------System Info---------- Platform : Linux-4.15.0-041500-generic-x86_64-with-debian-stretch-sid system : Linux node : ip-172-31-29-240 release : 4.15.0-041500-generic version : #201802011154 SMP Thu Feb 1 11:55:45 UTC 2018 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Stepping: 1 CPU MHz: 2694.905 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4600.12 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti retpoline fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0021 sec, LOAD: 0.5941 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1178 sec, LOAD: 0.5150 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.4319 sec, LOAD: 0.5457 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0546 sec, LOAD: 0.4905 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0152 sec, LOAD: 0.2675 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0211 sec, LOAD: 0.0892 sec. Package used (Python/R/Scala/Julia): I'm using python 3.6 ## Build info (Required if built from source) Compiler (gcc/clang/mingw/visual studio): MXNet commit hash: c2ba51b742229b245367a347f2d2cc0e9c8232a2 Build config: (Paste the content of config.mk, or the build command.) ## Error Message: Traceback (most recent call last): File "/home/ubuntu/605_experiment/incubator-mxnet/example/distributed_training/cifar10_dist.py", line 27, in <module> import mxnet as mx File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/__init__.py", line 57, in <module> from . import kvstore_server File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 85, in <module> _init_kvstore_server_module() File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module server.run() File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/kvstore_server.py", line 73, in run check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None)) File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [01:41:12] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed Stack trace returned 10 entries: [bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f23c2) [0x7f71dd54d3c2] [bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f2988) [0x7f71dd54d988] [bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38e9e7a) [0x7f71e0a44e7a] [bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38f3d7a) [0x7f71e0a4ed7a] [bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38e5069) [0x7f71e0a40069] [bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x324d438) [0x7f71e03a8438] [bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x7f) [0x7f71e018635f] [bt] (7) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f720ab58ec0] [bt] (8) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f720ab5887d] [bt] (9) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f720ad6de2e] ## Minimum reproducible example https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training ## Steps to reproduce python ~/605_experiment/incubator-mxnet/tools/launch.py -n 2 -s 2 -H hosts --sync-dst-dir /home/ubuntu/605_experiment/incubator-mxnet/example/distributed_training --launcher ssh "python /home/ubuntu/605_experiment/incubator-mxnet/example/distributed_training/cifar10_dist.py" ## What have you tried to solve it? 1. Using private IPs of each instance instead of hostname. I have three instances, one host named s0, and other two named d1 and d2. My ~/.ssh/config looks like: Host s0 HostName 172.31.29.240 user ubuntu IdentityFile /home/ubuntu/ScalableML.pem IdentitiesOnly yes Host d1 HostName 172.31.34.204 user ubuntu IdentityFile /home/ubuntu/ScalableML.pem IdentitiesOnly yes Host d2 HostName 172.31.33.222 user ubuntu IdentityFile /home/ubuntu/ScalableML.pem IdentitiesOnly yes content of hosts file: d1 d2 all instances can ssh to each other without requiring authentication. All TCP and SSH traffic is allowed inbound and outbound. 2. No python processes are running in any instances at the time of running launch.py. I'm launching hosts.py on instance s0. 3. the example runs when processes are local to one instance (launcher local option) but fails with ssh with three instances. 4. The example has store = kv.create(‘dist’). I tried with store = kv.create(‘dist_async’) but i'm running into the same issue.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
