> > Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work. I'm > debugging now. >
Here's part of the problem (from ompi/btl/udapl/btl_udapl.c): /* TODO - big bad evil hack! */ /* uDAPL doesn't ever seem to keep track of ports with addresses. This becomes a problem when we use dat_ep_query() to obtain a remote address on an endpoint. In this case, both the DAT_PORT_QUAL and the sin_port field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is a problem when we have more than one uDAPL process per IA - these processes will have exactly the same address, as the port is all we have to differentiate who is who. Thus, our uDAPL EP -> BTL EP matching algorithm will break down. So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for this IA. uDAPL then conveniently propagates this to where we need it. */ ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port); ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port); The OMPI code stuffs the port chosen by udapl for a listening endpoint into the ia address memory (which is owned by the udapl layer btw). There's a slight problem with that: The OFA udapl openib_cma code binds cm_id's to this ia_address regularly. When an hca is opened, a cm_id is bound to this address to obtain the local hca port number and gid that is being used. In addition, a cm_id is bound to this address each time an endpoint is created (either at ep_create time or ep_connect time). So that ia_address field is used by the dapl cm to create local cm_ids... Since the port was always zero, the rmda-cma would choose a unique port for each cm_id bound to that address. But OMPI sets a the port field to non-zero, the rdma_cma fails all the subsequent rdma_bind_addr() calls since the port is already in use. Perhaps this hack really is a workaround for a DAPL bug where somebodies dapl wasn't tracking port numbers correctly? I think there are three issues here: 1) OMPI shouldn't be stepping on the ia_address. 2) OFA udapl should probably be explicitly binding local cm_ids to port zero. 3) dat_ep_query() should be returning the correct port numbers... I'm going to run a few experiments: 1) remove the OMPI hack and see if things work fine for OFA udapl. Perhaps OFA udapl correctly tracks ports on endpoints? 2) leave OMPI as-is and change OFA udapl to not assume the ia_addr sockaddr has a 0 port in it. Steve.