Hi guys,

I have a problem regarding the subject. The detail is below.
Is there anybody who can answer this behavior is a restriction of
openmpi or something?

I executed an mpi program and caught the following error related to 
ibv_create_ah.

[sho@host0 ~]$ /opt/openmpi1103_debug/bin/mpirun -host host0,host1 -npernode 1 
-np 2 ./sample
PROC(0): senddata = 10
libibverbs: ibv_create_ah failed to query port.
[host1:4395] *** An error occurred in MPI_Send
[host1:4395] *** reported by process [139776618004481,0]
[host1:4395] *** on communicator MPI_COMM_WORLD
[host1:4395] *** MPI_ERR_OTHER: known error not in list
[host1:4395] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[host1:4395] ***    and potentially your MPI job)

host0 has a ConnectX-3 HCA with 2 ports and a cable is connected with the port 
2.
host1 has a ConnectX-4 HCA with 1 port and a cable is connected with the port 1.

The function udcm_endpoint_init_data seems to pass a remote port number to 
ibv_create_ah.
I added a printf to output remote_msg->mm_port_num and found it output 1 on 
host0,
output 2 on host1.
Is this correct? I think a local port number should be specified to 
ibv_create_ah.

static int udcm_endpoint_init_data (mca_btl_base_endpoint_t *lcl_ep)
           :                                  :
        ah_attr.dlid          = lcl_ep->rem_info.rem_lid;
        ah_attr.port_num      = remote_msg->mm_port_num; <****** It's a remote 
port.
        ah_attr.sl            = mca_btl_openib_component.ib_service_level;
        ah_attr.src_path_bits = lcl_ep->endpoint_btl->src_path_bits;

        udep->ah = ibv_create_ah (lcl_ep->endpoint_btl->device->ib_pd, 
&ah_attr);

I modified the above code to specify a local port directly.  The sample code was
executed correctly on host0 and host1.

With best regards,
Takashi Sato

Reply via email to