Vikas89 commented on issue #13526: distributed training van.cc Check failed URL: https://github.com/apache/incubator-mxnet/issues/13526#issuecomment-444668438 As I see there are 3 different issues here: ``` File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(LIB.MXGetLastError())) mxnet.base.MXNetError: [08:54:25] src/van.cc:291: Check failed: (my_node.port) != (-1) bind failed ``` 1) Host file - if you say -n 2, there will be 2 worker and 2 server. If you have only one line with host and port, all of the processes will try to launch on same port. So work around is same as what I suggested earlier. Please use only host and let mxnet chose port. If you want chose port yourself, find 4 different ports which are not used and use 4 entries in host file. Ideally you should have multiple hosts for distributed training. ``` `Traceback (most recent call last): File "/userhome/incubator-mxnet/tools/launch.py", line 128, in main() File "/userhome/incubator-mxnet/tools/launch.py", line 109, in main raise RuntimeError('Unknown submission cluster type %s' % args.cluster) RuntimeError: Unknown submission cluster type ssh ``` This seems like a launch script issue. Can you try not giving --launcher option in command line, and using and use full host file path in -H option ``` usage: image_classification.py [-h] [--dataset DATASET] [--data-dir DATA_DIR] [--num-worker NUM_WORKERS] [--batch-size BATCH_SIZE] [--gpus GPUS] [--epochs EPOCHS] [--lr LR] [--momentum MOMENTUM] [--wd WD] [--seed SEED] [--mode MODE] --model MODEL [--use_thumbnail] [--batch-norm] [--use-pretrained] [--prefix PREFIX] [--start-epoch START_EPOCH] [--resume RESUME] [--lr-factor LR_FACTOR] [--lr-steps LR_STEPS] [--dtype DTYPE] [--save-frequency SAVE_FREQUENCY] [--kvstore KVSTORE] [--log-interval LOG_INTERVAL] [--profile] [--builtin-profiler BUILTIN_PROFILER] image_classification.py: error: unrecognized arguments: epochs 1 ``` This is problem with training code. If it is coming from examples this needs to be fixed.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
