On Tue, Oct 06, 2009 at 06:20:05PM -0700, Sean Hefty wrote:

> There are 3 interfaces of interest here.  The librdmacm API, the rdma_ucm user
> to kernel interface, and the rdma_cm interface.  These patches are looking to
> change the rdma_ucm interface.  I want to avoid changing the API or behavior 
> of
> the librdmacm in a way that requires changes to existing applications in order
> to run on larger clusters. 

So, I'm just talking about the user space API, the others can be
changed as necessary to align with it. 

This is open source, so choosing a technically better solution over a
endlessly backwards compatible solution is done all the time and is
normal, expected, etc. Cost of progress - that is the underlying
rational of Documentation/stable_api_nonsense.txt, and it applies just
as well to niche little user space libraries like these :)

> >So the output from IBACM would specify on AF_GID address family and
> >include opaque data blobs that are passed through the RDMA CM API that
> >contain all the PR records, service ID, etc. If used on non-IB then
> >IBACM could just return AF_IP/AF_IPV6 and related blobs. Thus the
> >consumer of the API gets transparency and network protocol agility,
> >and all the mess can be hid in the address resolution API.
> 
> This is just debating where the transport abstraction occurs, but IMO the IB 
> ACM
> should be IB centric.  Transport abstraction should occur somewhere
> above it.

Actually, I'm arguing that RDMA CM should have been a transport
mux/switch like socket() rather than just an IP addressing abstraction.

If it is a mux then an app coded to RDMA CM could speak native IB GID
addressing on the same API and the scaling problems related to arp and
the implict kernel PR query of the IP abstraction can be neatly
eliminated, by eliminating the abstraction. It is the abstraction to
IP addresses that is the root inefficiency - we need a transport
protocol agnostic API for CM that lets the native transport addressing
be used - not an abstraction layer (abstractions are always the bane
of efficiency)..

If you want to have a naming layer in user space that converts
*whatever* to GIDs then fine, great, but lets call it that and not
co-mingle it with the IP address abstraction layer.

Adding IB CM semantics to the RDMA CM API does not seem to be too
hard:
 rdma_create_id uses an new RDMA_PS_IB to signal IB CM behavior
 rmda_resolve_addr and all other functions use a sockaddr that is an
   IB GID, pkey, service ID etc. rdma_resolve_addr at least grows a
   new parameter which is a 'hw address'. This is mandatory for
   RDMA_PS_IB and is up to 5 PR records. (1 for CM path,
   forward/reverse for primary, and forward/reverse for alternative)

 rdma_get_addr_info return a struct with the rdma_port_space, sockaddr
 src, sockaddr dest and 'hw address' values that the app blindly plugs
 into the calls. libacm type function is implemented entirely in the
 rdma_get_addr_info.

The trade off is that apps that want to scale use IB GID addressing,
IB GID CM, and IB service ID at the rdma_cm layer, and libacm provides
a name mapping from hostname or IP to GID, if its being used. No IP
addreses, no IP listen matching, no port space TCP issues. Just
straight IB services IDs.

Pretty much the symmetry is simple.. The Kernel always takes care of
IP hardware addressing, userspace always takes care of IB
hardware addressing. Nature of the two protocols.

> This still leaves open the issue of how to communicate that data to the kernel
> so that the rdma_cm can format the IB CM REQ correctly and send it on its 
> merry
> little way.

The rdma_ucm interface would have to be extended to be able to do 100%
of the functionality of the ib cm interface using the rdma_cm_id
abstraction. This is very useful in of itself and much better than
adding an obscure option to override the ARP query. For instance,
other MPIs could immediately provide their users an option to use GID
addresess directly and cut out the ARP overhead instantly with
little code change.

> >Another topic, but yes, ip route get just does a netlink
> >queury. I can give you all the details if you want to try it.
> 
> Yes, please - see below

I'll look in my codes, remind me if I forget, I can't do it just now
..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to