Re: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI and RDMA-CM)

Donald Kerr Tue, 8 May 2007 16:21:27 -0400


Steve Wise wrote:

On Tue, 2007-05-08 at 13:57 -0400, Andrew Friedley wrote:

Steve Wise wrote:
Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work.  I'm
debugging now.
Here's part of the problem (from ompi/btl/udapl/btl_udapl.c):

   /* TODO - big bad evil hack! */
   /* uDAPL doesn't ever seem to keep track of ports with addresses.  This
      becomes a problem when we use dat_ep_query() to obtain a remote address
      on an endpoint.  In this case, both the DAT_PORT_QUAL and the sin_port
      field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is
      a problem when we have more than one uDAPL process per IA - these
      processes will have exactly the same address, as the port is all
      we have to differentiate who is who.  Thus, our uDAPL EP -> BTL EP
      matching algorithm will break down.

      So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for
      this IA.  uDAPL then conveniently propagates this to where we need it.
    */
   ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port);
   ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port);

The OMPI code stuffs the port chosen by udapl for a listening endpoint
into the ia address memory (which is owned by the udapl layer btw).
There's a slight problem with that:  The OFA udapl openib_cma code binds
cm_id's to this ia_address regularly.  When an hca is opened, a cm_id is
bound to this address to obtain the local hca port number and gid that
is being used.  In addition, a cm_id is bound to this address each time
an endpoint is created (either at ep_create time or ep_connect time).
So that ia_address field is used by the dapl cm to create local
cm_ids...  Since the port was always zero, the rmda-cma would choose a
unique port for each cm_id bound to that address.
But OMPI sets a the port field to non-zero, the rdma_cma fails all the
subsequent rdma_bind_addr() calls since the port is already in use.

Perhaps this hack really is a workaround for a DAPL bug where somebodies
dapl wasn't tracking port numbers correctly?
Yep. My memory is dim, but I think that was OFED's DAPL, or it was inthe generic part of DAPL that all implementations seem to share.
As hinted by the comment (I wrote it by the way), I think the bestsolution would be if dat_ep_query() returned the port number correctly.Most of uDAPL seems to just pass around pointers to internal datastructures (which I'm not sure is the best idea in the world), so itdidn't seem like a trivial fix to me at the time. I rememberconsidering reporting this as a bug, but I didn't because the uDAPLstandard didn't seem to enforce any requirements on passing the portnumber around with the address, so it technically wasn't wrong.
Was the OFED uDAPL code switched from something else to RDMA CM at somepoint? I'm almost certain I was running fine on OFED's uDAPL at onepoint (in fact, a lot of the uDAPL BTL development I did was using theOFED stack).


Yes, the OFA uDAPL was changed from using the ib-cm to the rdma-cm a
while back.  Perhaps you ran on the ib-cm version?  And, the rdma-cma
started using port numbers and enforcing uniqueness even more recently I
think.

Perhaps Don Kerr has some insight on how the Sun uDAPL behaves?  Should
OMPI still need this hack?

From what I recall, and Andrew can probably set me straight if I getthis wrong. This hack was included because we were not able to pull theremote port from dat_ep_query. If dat_ep_query supplies that data thenwe could probably do away with the hack.

I have not heard back from the developer at Sun who implemented uDAPLfor Solaris. My thought is that it was also based on the older ib-cm butwill confirm. I submitted a bug against Solaris uDAPL to provide theport via dat_ep_query awhile back and it looks like it has been fixed, Ijust have not tested this because we weren't using it.


-DON


Steve.

Re: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI and RDMA-CM)

Reply via email to