tonnywang opened a new issue #9695: distribute training failure about van error
   Hi, all,
   I met an issue about van error when I tried to run distribute training, 2 
machines, 3 gpus per node.
   The error message is as the below.
   Traceback (most recent call last):
     File "", line 195, in <module>
     File "", line 192, in main, lr_step=args.lr_step)
     File "", line 154, in train_net
       arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, 
     File "/home/mxnet/mxnet-1.0/python/mxnet/module/", line 466, 
in fit
     File "/home/mxnet/mxnet-1.0/example/rcnn/rcnn/core/", line 173, 
in init_optimizer
     File "/home/mxnet/mxnet-1.0/python/mxnet/module/", line 499, in 
       _create_kvstore(kvstore, len(self._context), self._arg_params)
     File "/home/mxnet/mxnet-1.0/python/mxnet/", line 82, in 
       kv = kvs.create(kvstore)
     File "/home/mxnet/mxnet-1.0/python/mxnet/", line 655, in create
     File "/home/mxnet/mxnet-1.0/python/mxnet/", line 146, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [13:30:33] src/ Check failed: !ip.empty() 
failed to get ip
   What is the reason about this issue, moreover what are the environment 
variables setting for distribute training based on Ethernet communication.

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

Reply via email to