Davdi opened a new issue #13526: distributed training  van.cc Check failed
URL: https://github.com/apache/incubator-mxnet/issues/13526
 
 
   
   ## Description
   i used this shell to test  distributed training  but it shows this error 
    ../../tools/launch.py -n 2 -H hosts --launcher ssh python 
image_classification.py --dataset cifar10 --model vgg11 epochs 1 --kvstore 
dist_sync
   
       _init_kvstore_server_module()
     File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", 
line 82, in _init_kvstore_server_module
       server.run()
     File "/usr/local/lib/python3.6/dist-packages/mxnet/kvstore_server.py", 
line 73, in run
       check_call(_LIB.MXKVStoreRunServer(self.handle, 
_ctrl_proto(self._controller()), None))
     File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [08:18:33] src/van.cc:291: Check failed: 
(my_node_.port) != (-1) bind failed
   
   ## Environment info (Required)
   unbutu 16.04 
   1 ps
   1 worker(2 gpu)
   
   
   ```
   
   Package used (Python/R/Scala/Julia):
   (I'm using ...)
   
   For Scala user, please provide:
   1. Java version: (`java -version`)
   2. Maven version: (`mvn -version`)
   3. Scala runtime if applicable: (`scala -version`)
   
   For R user, please provide R `sessionInfo()`:
   
   ## Build info (Required if built from source)
   
   Compiler (gcc/clang/mingw/visual studio):
   
   MXNet commit hash:
   (Paste the output of `git rev-parse HEAD` here.)
   
   Build config:
   (Paste the content of config.mk, or the build command.)
   
   ## Error Message:
   (Paste the complete error message, including stack trace.)
   
   ## Minimum reproducible example
   (If you are using your own code, please provide a short script that 
reproduces the error. Otherwise, please provide link to the existing example.)
   
   ## Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1.git clone incubator-mxnet and git clone dmlc-core 
   2.cd dmlc-core && make 
   3  cd incubator-mxnet/example/gluon/ 
   ../../tools/launch.py -n 2 -H hosts --launcher ssh python 
image_classification.py --dataset cifar10 --model vgg11 epochs 1 --kvstore 
dist_sync
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to