Here is a little document I wrote trying to summarize all the things that we might want to add to the verbs API to support device capabilities that aren't exposed yet. There are a number of issues to resolve, and answers to the questions I ask below would help us make progress towards actually supporting all this.
There are a number of verbs that are common to the iWARP/RDMA consortium verbs and the InfiniBand base memory management extensions (IB-BMME). We would probably add one device capability bit for "BMME" (and all iWARP devices could set it) to show support for everything here: - Allocate L_Key/STag. This allocates MR resources without actually registering memory; the MR can then be registered or invalidated as described below. - "Fast register" memory through send queue. This allows a work request to be posted to a send queue to register memory using an L_Key/STag that is in the invalid state. - Local invalidate send work requests, which can be used to invalidate an MR or MW. One subtle point here is that local invalidate operations have very loose ordering, in the sense that they can be executed before earlier requests, but support for fencing local invalidate operations is mandatory in iWARP and only optional in IB. But is there any IB device that currently exists that supports BMME but doesn't support local invalidate fencing? I really hope we can ignore this possibility. - Memory windows associated to a single QP and bound using send work requests posted with the normal post send verb rather than a separate MW verb. (See below for more) In addition there are things that are optional in both specs: - Block-list physical buffer lists; this allows memory regions to be registered with arbitrary size/alignment blocks instead of just page-aligned chunks. Yet another capability bit if we want to expose this. There are a few discrepancies between the iWARP and IB verbs that we need to decide on how we want to handle: - In IB-BMME, L_Keys and R_Keys are split up so that there is an 8-bit "key" that is owned by the consumer. As far as I know, there is no analogous concept defined for iWARP STags; is there any point in supporting this IB-only feature (which is optional even in the IB spec)? - Along similar lines, IB defines two types of memory windows, "type 1" and "type 2" and in fact type 2 is split into "2A" and "2B" (the difference is basically whether the MW is associated with just a QP, or with a QP and a PD). iWARP memory windows are always what the IB spec would call type 2B. All the IB devices that I know of with IB-BMME support can handle type 2B memory windows. Is there any point in having our API worry about the distinction between 2A or 2B, or should we just decree that we only handle type 2B? (Does anyone who hasn't just been reading specs even understand the distinction between type 2A and 2B?) - Further, the MW API that we have now, with a separate bind MW verb, corresponds to type 1 MWs. Type 2 MWs are bound by posting a work request using the standard "post send" verb. Given that no IB device drivers have implemented the bind MW verb yet, does it make sense to deprecate the API for type 1 MWs and say that everyone should use type 2[B] MWs only? - iWARP supports "RDMA read with invalidate" send work requests, while IB has no such operation. This makes sense because iWARP requires the buffer used to receive RDMA read responses to have remote write permission, while IB has no such requirement. I don't see a really clean way to handle this except to say that apps have to have "if (IB) do_this(); else /* iWARP */ do_that();" code to use this in a portable way. - Zero-based virtual addresses for memory regions. This is mandatory for iWARP and optional for IB (and is not required even for BMME). I think the simplest thing to do is just to have yet another capability bit to say whether a device supports ZBVA or not; all iWARP devices can set it. Finally, there are proprietary verbs extensions that are only supported by a single device at the moment, which we have to decide if and how to support. It is a tradeoff between making useful features available versus making the already overly complex verbs API even more impossible to fathom, although it seems all of these have users asking for them: - ConnectX has XRC, masked atomic operations, and the "block loopback" flag for UD QPs at least. - eHCA has "low-latency" QPs. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
