Hi guys, I have a problem regarding the subject. The detail is below. Is there anybody who can answer this behavior is a restriction of openmpi or something?
I executed an mpi program and caught the following error related to ibv_create_ah. [sho@host0 ~]$ /opt/openmpi1103_debug/bin/mpirun -host host0,host1 -npernode 1 -np 2 ./sample PROC(0): senddata = 10 libibverbs: ibv_create_ah failed to query port. [host1:4395] *** An error occurred in MPI_Send [host1:4395] *** reported by process [139776618004481,0] [host1:4395] *** on communicator MPI_COMM_WORLD [host1:4395] *** MPI_ERR_OTHER: known error not in list [host1:4395] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [host1:4395] *** and potentially your MPI job) host0 has a ConnectX-3 HCA with 2 ports and a cable is connected with the port 2. host1 has a ConnectX-4 HCA with 1 port and a cable is connected with the port 1. The function udcm_endpoint_init_data seems to pass a remote port number to ibv_create_ah. I added a printf to output remote_msg->mm_port_num and found it output 1 on host0, output 2 on host1. Is this correct? I think a local port number should be specified to ibv_create_ah. static int udcm_endpoint_init_data (mca_btl_base_endpoint_t *lcl_ep) : : ah_attr.dlid = lcl_ep->rem_info.rem_lid; ah_attr.port_num = remote_msg->mm_port_num; <****** It's a remote port. ah_attr.sl = mca_btl_openib_component.ib_service_level; ah_attr.src_path_bits = lcl_ep->endpoint_btl->src_path_bits; udep->ah = ibv_create_ah (lcl_ep->endpoint_btl->device->ib_pd, &ah_attr); I modified the above code to specify a local port directly. The sample code was executed correctly on host0 and host1. With best regards, Takashi Sato