tonnywang opened a new issue #9695: distribute training failure about van error
URL: https://github.com/apache/incubator-mxnet/issues/9695
 
 
   Hi, all,
   I met an issue about van error when I tried to run distribute training, 2 
machines, 3 gpus per node.
   The error message is as the below.
   Traceback (most recent call last):
     File "train_end2end.py", line 195, in <module>
       main()
     File "train_end2end.py", line 192, in main
       lr=args.lr, lr_step=args.lr_step)
     File "train_end2end.py", line 154, in train_net
       arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, 
num_epoch=end_epoch)
     File "/home/mxnet/mxnet-1.0/python/mxnet/module/base_module.py", line 466, 
in fit
       optimizer_params=optimizer_params)
     File "/home/mxnet/mxnet-1.0/example/rcnn/rcnn/core/module.py", line 173, 
in init_optimizer
       force_init=force_init)
     File "/home/mxnet/mxnet-1.0/python/mxnet/module/module.py", line 499, in 
init_optimizer
       _create_kvstore(kvstore, len(self._context), self._arg_params)
     File "/home/mxnet/mxnet-1.0/python/mxnet/model.py", line 82, in 
_create_kvstore
       kv = kvs.create(kvstore)
     File "/home/mxnet/mxnet-1.0/python/mxnet/kvstore.py", line 655, in create
       ctypes.byref(handle)))
     File "/home/mxnet/mxnet-1.0/python/mxnet/base.py", line 146, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [13:30:33] src/van.cc:63: Check failed: !ip.empty() 
failed to get ip
   
   What is the reason about this issue, moreover what are the environment 
variables setting for distribute training based on Ethernet communication.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to