Davdi edited a comment on issue #13526: distributed training van.cc Check failed URL: https://github.com/apache/incubator-mxnet/issues/13526#issuecomment-444410347 > Are you running in containers? > You said "the env is 1 ps and 1 worker and should i follow the instruction on all the 2 container ?" > but host file only has one entry, shouldn't it be 2 entries! > > Can you try these steps - > > 1. Make sure that there is no python process running on any of the host. > 2. Make sure that for all the entries in host file, you can ssh to those host from master node(the node where you ran launch.py) > 3. if you are using ec2 instance, try using private ip in hosts file > 4. Launch distributed training without port: 192.168.113.223 , mxnet will automatically chose port for workers > > If error persist - > 4) Paste the output of : > echo $env > 5) cat hosts > 6) For each entry in hosts file , ssh to host and paste output of : > ps -efl | grep python > 7) paste launch command and Paste the entire log that you get after running launch.py I run the command on the container .but not aws , 4)echo $env it shows nothing 5) cat hosts 192.168.113.223:10004 this is ip of the worker container 6) ssh -i application_1543732416493_0027 -p 10004 root@192.168.113.223 and it is successfully application_1543732416493_0027 is the primary key > root@ecbf58635533:/userhome/incubator-mxnet/example/gluon# ps -efl | grep python 0 S root 236 1 0 80 0 - 1126 wait 07:22 pts/1 00:00:00 /bin/sh -c python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 239 236 0 80 0 - 1671321 hrtime 07:22 pts/1 00:00:21 python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 402 239 0 80 0 - 7441 pipe_w 07:22 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 435 1 0 80 0 - 1126 wait 07:23 pts/1 00:00:00 /bin/sh -c python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 437 435 0 80 0 - 1671320 hrtime 07:23 pts/1 00:00:19 python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 601 437 0 80 0 - 7441 pipe_w 07:23 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 637 1 0 80 0 - 1126 wait 07:25 pts/1 00:00:00 /bin/sh -c python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 639 637 0 80 0 - 1671321 hrtime 07:25 pts/1 00:00:19 python image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 799 639 0 80 0 - 7441 pipe_w 07:25 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 996 1 0 80 0 - 1126 wait 07:31 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 999 996 0 80 0 - 1671321 hrtime 07:31 pts/1 00:00:27 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 1158 999 0 80 0 - 7441 pipe_w 07:31 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 1416 1 0 80 0 - 1126 wait 07:41 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 1419 1416 0 80 0 - 1671321 hrtime 07:41 pts/1 00:00:30 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 1582 1419 0 80 0 - 7441 pipe_w 07:41 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 17930 1 0 80 0 - 1126 wait 08:01 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 17933 17930 0 80 0 - 1671381 hrtime 08:01 pts/1 00:00:24 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18092 17933 0 80 0 - 7441 pipe_w 08:01 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 18183 1 0 80 0 - 1126 wait 08:07 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18187 18183 0 80 0 - 1671380 hrtime 08:07 pts/1 00:00:29 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18349 18187 0 80 0 - 7441 pipe_w 08:07 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 18412 1 0 80 0 - 1126 wait 08:11 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18416 18412 0 80 0 - 1671382 hrtime 08:11 pts/1 00:00:27 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18574 18416 0 80 0 - 7441 pipe_w 08:12 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 18746 1 0 80 0 - 1126 wait 08:24 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18749 18746 0 80 0 - 1671380 hrtime 08:24 pts/1 00:00:18 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18908 18749 0 80 0 - 7441 pipe_w 08:24 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 18928 1 0 80 0 - 1126 wait 08:25 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 18931 18928 0 80 0 - 1671381 hrtime 08:25 pts/1 00:00:17 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 19089 18931 0 80 0 - 7441 pipe_w 08:25 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 19228 1 0 80 0 - 1126 wait 08:26 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 19231 19228 1 80 0 - 1671381 hrtime 08:26 pts/1 00:00:23 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 19390 19231 0 80 0 - 7441 pipe_w 08:27 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 19678 1 0 80 0 - 1126 wait 08:49 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 19681 19678 5 80 0 - 1671380 hrtime 08:49 pts/1 00:00:29 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 19840 19681 0 80 0 - 7441 pipe_w 08:49 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 19900 1 0 80 0 - 1126 wait 08:53 pts/1 00:00:00 /bin/sh -c python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 19903 19900 7 80 0 - 1671380 hrtime 08:53 pts/1 00:00:26 python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync 0 S root 20062 19903 0 80 0 - 7441 pipe_w 08:53 pts/1 00:00:00 /usr/local/bin/python -c from multiprocessing.semaphore_tracker import main;main(3) 0 S root 20144 169 0 80 0 - 2821 pipe_w 08:59 pts/1 00:00:00 grep --color=auto python > `../../tools/launch.py -n 2 -H hosts --launcher ssh python image_classification.py --dataset cifar10 --model vgg11 epochs 1 --kvstore dist_sync` the log is `Traceback (most recent call last): File "/userhome/incubator-mxnet/tools/launch.py", line 128, in <module> main() File "/userhome/incubator-mxnet/tools/launch.py", line 109, in main raise RuntimeError('Unknown submission cluster type %s' % args.cluster) RuntimeError: Unknown submission cluster type ssh root@ecbf58635533:/userhome/incubator-mxnet/example/gluon# /userhome/incubator-mxnet/tools/launch.py -n 1 -H hosts python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync /usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp [08:53:21] src/van.cc:290: Bind to role=scheduler, id=1, ip=172.17.0.4, port=9103, is_recovery=0 /usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp Traceback (most recent call last): File "/userhome/incubator-mxnet/example/gluon/image_classification.py", line 23, in <module> import mxnet as mx File "/usr/local/lib/python3.6/dist-packages/mxnet/__init__.py", line 57, in <module> from . import kvstore_server File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", line 85, in <module> _init_kvstore_server_module() File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module server.run() File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", line 73, in run check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None)) File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [08:54:25] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x382d4a) [0x7f191e65bd4a] [bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x383381) [0x7f191e65c381] [bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31c1bca) [0x7f192149abca] [bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31cbaca) [0x7f19214a4aca] [bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31bcdb9) [0x7f1921495db9] [bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2c84b63) [0x7f1920f5db63] [bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x88) [0x7f1920d4d2b8] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f19becbce40] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f19becbc8ab] [bt] (9) /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2c6) [0x7f19beed09e6] /usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp usage: image_classification.py [-h] [--dataset DATASET] [--data-dir DATA_DIR] [--num-worker NUM_WORKERS] [--batch-size BATCH_SIZE] [--gpus GPUS] [--epochs EPOCHS] [--lr LR] [--momentum MOMENTUM] [--wd WD] [--seed SEED] [--mode MODE] --model MODEL [--use_thumbnail] [--batch-norm] [--use-pretrained] [--prefix PREFIX] [--start-epoch START_EPOCH] [--resume RESUME] [--lr-factor LR_FACTOR] [--lr-steps LR_STEPS] [--dtype DTYPE] [--save-frequency SAVE_FREQUENCY] [--kvstore KVSTORE] [--log-interval LOG_INTERVAL] [--profile] [--builtin-profiler BUILTIN_PROFILER] image_classification.py: error: unrecognized arguments: epochs 1 Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/dmlc_tracker/ssh.py", line 62, in run subprocess.check_call(prog, shell = True) File "/usr/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 192.168.113.227 -p 10001 'export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export DMLC_PS_ROOT_URI=172.17.0.4; export DMLC_PS_ROOT_PORT=9103; export DMLC_ROLE=server; export DMLC_NODE_HOST=192.168.113.227; cd /userhome/incubator-mxnet/example/gluon/; python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync'' returned non-zero exit status 1. Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/dmlc_tracker/ssh.py", line 62, in run subprocess.check_call(prog, shell = True) File "/usr/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 192.168.113.227 -p 10001 'export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; export DMLC_PS_ROOT_URI=172.17.0.4; export DMLC_PS_ROOT_PORT=9103; export DMLC_ROLE=worker; export DMLC_NODE_HOST=192.168.113.227; cd /userhome/incubator-mxnet/example/gluon/; python /userhome/incubator-mxnet/example/gluon/image_classification.py --model vgg11 epochs 1 --kvstore dist_sync'' returned non-zero exit status 2. `
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services