Davdi edited a comment on issue #13526: distributed training  van.cc Check 
failed
URL: 
https://github.com/apache/incubator-mxnet/issues/13526#issuecomment-444410347
 
 
   > Are you running in containers?
   > You said "the env is 1 ps and 1 worker and should i follow the instruction 
on all the 2 container ?"
   > but host file only has one entry, shouldn't it be 2 entries!
   > 
   > Can you try these steps -
   > 
   > 1. Make sure that there is no python process running on any of the host.
   > 2. Make sure that for all the entries in host file, you can ssh to those 
host from master node(the node where you ran launch.py)
   > 3. if you are using ec2 instance, try using private ip in hosts file
   > 4. Launch distributed training without port: 192.168.113.223 , mxnet will 
automatically chose port for workers
   > 
   > If error persist -
   > 4) Paste the output of :
   > echo $env
   > 5) cat hosts
   > 6) For each entry in hosts file , ssh to host and paste output of :
   > ps -efl | grep python
   > 7) paste launch command and Paste the entire log that you get after 
running launch.py
   I run the command on the container .but not aws , 
   4)echo $env  it shows nothing
   5) cat hosts    
   192.168.113.223:10004 this is ip of the worker container 
   6)   ssh -i application_1543732416493_0027 -p 10004 root@192.168.113.223 and 
it is successfully
   application_1543732416493_0027  is the primary key 
   
   > root@ecbf58635533:/userhome/incubator-mxnet/example/gluon# ps -efl | grep 
python
   0 S root        236      1  0  80   0 -  1126 wait   07:22 pts/1    00:00:00 
/bin/sh -c python image_classification.py --model vgg11 epochs 1 --kvstore 
dist_sync
   0 S root        239    236  0  80   0 - 1671321 hrtime 07:22 pts/1  00:00:21 
python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync
   0 S root        402    239  0  80   0 -  7441 pipe_w 07:22 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root        435      1  0  80   0 -  1126 wait   07:23 pts/1    00:00:00 
/bin/sh -c python image_classification.py --model vgg11 epochs 1 --kvstore 
dist_sync
   0 S root        437    435  0  80   0 - 1671320 hrtime 07:23 pts/1  00:00:19 
python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync
   0 S root        601    437  0  80   0 -  7441 pipe_w 07:23 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root        637      1  0  80   0 -  1126 wait   07:25 pts/1    00:00:00 
/bin/sh -c python image_classification.py --model vgg11 epochs 1 --kvstore 
dist_sync
   0 S root        639    637  0  80   0 - 1671321 hrtime 07:25 pts/1  00:00:19 
python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync
   0 S root        799    639  0  80   0 -  7441 pipe_w 07:25 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root        996      1  0  80   0 -  1126 wait   07:31 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root        999    996  0  80   0 - 1671321 hrtime 07:31 pts/1  00:00:27 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root       1158    999  0  80   0 -  7441 pipe_w 07:31 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root       1416      1  0  80   0 -  1126 wait   07:41 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root       1419   1416  0  80   0 - 1671321 hrtime 07:41 pts/1  00:00:30 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root       1582   1419  0  80   0 -  7441 pipe_w 07:41 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      17930      1  0  80   0 -  1126 wait   08:01 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      17933  17930  0  80   0 - 1671381 hrtime 08:01 pts/1  00:00:24 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      18092  17933  0  80   0 -  7441 pipe_w 08:01 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      18183      1  0  80   0 -  1126 wait   08:07 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      18187  18183  0  80   0 - 1671380 hrtime 08:07 pts/1  00:00:29 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      18349  18187  0  80   0 -  7441 pipe_w 08:07 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      18412      1  0  80   0 -  1126 wait   08:11 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      18416  18412  0  80   0 - 1671382 hrtime 08:11 pts/1  00:00:27 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      18574  18416  0  80   0 -  7441 pipe_w 08:12 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      18746      1  0  80   0 -  1126 wait   08:24 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      18749  18746  0  80   0 - 1671380 hrtime 08:24 pts/1  00:00:18 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      18908  18749  0  80   0 -  7441 pipe_w 08:24 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      18928      1  0  80   0 -  1126 wait   08:25 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      18931  18928  0  80   0 - 1671381 hrtime 08:25 pts/1  00:00:17 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      19089  18931  0  80   0 -  7441 pipe_w 08:25 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      19228      1  0  80   0 -  1126 wait   08:26 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      19231  19228  1  80   0 - 1671381 hrtime 08:26 pts/1  00:00:23 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      19390  19231  0  80   0 -  7441 pipe_w 08:27 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      19678      1  0  80   0 -  1126 wait   08:49 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      19681  19678  5  80   0 - 1671380 hrtime 08:49 pts/1  00:00:29 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      19840  19681  0  80   0 -  7441 pipe_w 08:49 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      19900      1  0  80   0 -  1126 wait   08:53 pts/1    00:00:00 
/bin/sh -c python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   0 S root      19903  19900  7  80   0 - 1671380 hrtime 08:53 pts/1  00:00:26 
python /userhome/incubator-mxnet/example/gluon/image_classification.py --model 
vgg11 epochs 1 --kvstore dist_sync
   0 S root      20062  19903  0  80   0 -  7441 pipe_w 08:53 pts/1    00:00:00 
/usr/local/bin/python -c from multiprocessing.semaphore_tracker import 
main;main(3)
   0 S root      20144    169  0  80   0 -  2821 pipe_w 08:59 pts/1    00:00:00 
grep --color=auto python
   
   > 
   
   `../../tools/launch.py -n 2 -H hosts --launcher ssh python 
image_classification.py --dataset cifar10 --model vgg11 epochs 1 --kvstore 
dist_sync`
   
   the log is 
   `Traceback (most recent call last):
     File "/userhome/incubator-mxnet/tools/launch.py", line 128, in <module>
       main()
     File "/userhome/incubator-mxnet/tools/launch.py", line 109, in main
       raise RuntimeError('Unknown submission cluster type %s' % args.cluster)
   RuntimeError: Unknown submission cluster type ssh
   root@ecbf58635533:/userhome/incubator-mxnet/example/gluon# 
/userhome/incubator-mxnet/tools/launch.py -n 1 -H hosts python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync
   
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47:
 DeprecationWarning: the imp module is deprecated in favour of importlib; see 
the module's documentation for alternative uses
     import imp
   [08:53:21] src/van.cc:290: Bind to role=scheduler, id=1, ip=172.17.0.4, 
port=9103, is_recovery=0
   
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47:
 DeprecationWarning: the imp module is deprecated in favour of importlib; see 
the module's documentation for alternative uses
     import imp
   Traceback (most recent call last):
     File "/userhome/incubator-mxnet/example/gluon/image_classification.py", 
line 23, in <module>
       import mxnet as mx
     File "/usr/local/lib/python3.6/dist-packages/mxnet/__init__.py", line 57, 
in <module>
       from . import kvstore_server
     File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", 
line 85, in <module>
       _init_kvstore_server_module()
     File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", 
line 82, in _init_kvstore_server_module
       server.run()
     File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", 
line 73, in run
       check_call(_LIB.MXKVStoreRunServer(self.handle, 
_ctrl_proto(self._controller()), None))
     File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [08:54:25] src/van.cc:291: Check failed: 
(my_node_.port) != (-1) bind failed
   
   Stack trace returned 10 entries:
   [bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x382d4a) 
[0x7f191e65bd4a]
   [bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x383381) 
[0x7f191e65c381]
   [bt] (2) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31c1bca) 
[0x7f192149abca]
   [bt] (3) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31cbaca) 
[0x7f19214a4aca]
   [bt] (4) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31bcdb9) 
[0x7f1921495db9]
   [bt] (5) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2c84b63) 
[0x7f1920f5db63]
   [bt] (6) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x88)
 [0x7f1920d4d2b8]
   [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7f19becbce40]
   [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) 
[0x7f19becbc8ab]
   [bt] (9) 
/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2c6)
 [0x7f19beed09e6]
   
   
   
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47:
 DeprecationWarning: the imp module is deprecated in favour of importlib; see 
the module's documentation for alternative uses
     import imp
   usage: image_classification.py [-h] [--dataset DATASET] [--data-dir DATA_DIR]
                                  [--num-worker NUM_WORKERS]
                                  [--batch-size BATCH_SIZE] [--gpus GPUS]
                                  [--epochs EPOCHS] [--lr LR]
                                  [--momentum MOMENTUM] [--wd WD] [--seed SEED]
                                  [--mode MODE] --model MODEL [--use_thumbnail]
                                  [--batch-norm] [--use-pretrained]
                                  [--prefix PREFIX] [--start-epoch START_EPOCH]
                                  [--resume RESUME] [--lr-factor LR_FACTOR]
                                  [--lr-steps LR_STEPS] [--dtype DTYPE]
                                  [--save-frequency SAVE_FREQUENCY]
                                  [--kvstore KVSTORE]
                                  [--log-interval LOG_INTERVAL] [--profile]
                                  [--builtin-profiler BUILTIN_PROFILER]
   image_classification.py: error: unrecognized arguments: epochs 1
   Exception in thread Thread-2:
   Traceback (most recent call last):
     File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
       self.run()
     File "/usr/lib/python3.6/threading.py", line 864, in run
       self._target(*self._args, **self._kwargs)
     File "/usr/local/lib/python3.6/dist-packages/dmlc_tracker/ssh.py", line 
62, in run
       subprocess.check_call(prog, shell = True)
     File "/usr/lib/python3.6/subprocess.py", line 291, in check_call
       raise CalledProcessError(retcode, cmd)
   subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 
192.168.113.227 -p 10001 'export 
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64; export 
DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export 
DMLC_PS_ROOT_URI=172.17.0.4; export DMLC_PS_ROOT_PORT=9103; export 
DMLC_ROLE=server; export DMLC_NODE_HOST=192.168.113.227; cd 
/userhome/incubator-mxnet/example/gluon/; python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync'' returned non-zero exit status 1.
   
   Exception in thread Thread-3:
   Traceback (most recent call last):
     File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
       self.run()
     File "/usr/lib/python3.6/threading.py", line 864, in run
       self._target(*self._args, **self._kwargs)
     File "/usr/local/lib/python3.6/dist-packages/dmlc_tracker/ssh.py", line 
62, in run
       subprocess.check_call(prog, shell = True)
     File "/usr/lib/python3.6/subprocess.py", line 291, in check_call
       raise CalledProcessError(retcode, cmd)
   subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 
192.168.113.227 -p 10001 'export 
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64; export 
DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export 
DMLC_PS_ROOT_URI=172.17.0.4; export DMLC_PS_ROOT_PORT=9103; export 
DMLC_ROLE=worker; export DMLC_NODE_HOST=192.168.113.227; cd 
/userhome/incubator-mxnet/example/gluon/; python 
/userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 
epochs 1 --kvstore dist_sync'' returned non-zero exit status 2.
   `
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to