cyb70289 commented on pull request #12442:
URL: https://github.com/apache/arrow/pull/12442#issuecomment-1085377998


   For the "unknown address 0" ucx error, looks it's related to rdma network 
devices plugged in my test machine.
   I spawn a clean VM for test, there's no such error.
   
   Setting a breakpoint where the error is printed 
https://github.com/openucx/ucx/blob/v1.12.0/src/ucs/sys/sock.c#L660
   Interestingly, when the bp is fired, printing `addr->sa_family`, the value 
is 2 (AF_INET), logically impossible.
   Looks like `addr` is pointing to some volatile memory that's changed by 
other threads or hardware in parallel.
   
   `addr` is get by calling `rdma_get_local_addr` at 
https://github.com/openucx/ucx/blob/v1.12.0/src/uct/ib/rdmacm/rdmacm_cm_ep.c#L176
   
   From man page: https://linux.die.net/man/3/rdma_get_local_addr
   `rdma_get_local_addr` returns all zero if rdma nic is not bounded to an 
address. I do have some rdma nics disabled. The error looks harmless. Though it 
doesn't explain the strange behaviour found in the debugger.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to