Hey roland. Nice write-up. Comments in-line below:
Roland Dreier wrote:
Here is a little document I wrote trying to summarize all the things
that we might want to add to the verbs API to support device
capabilities that aren't exposed yet. There are a number of issues to
resolve, and answers to the questions I ask below would help us make
progress towards actually supporting all this.
There are a number of verbs that are common to the iWARP/RDMA
consortium verbs and the InfiniBand base memory management extensions
(IB-BMME). We would probably add one device capability bit for "BMME"
(and all iWARP devices could set it) to show support for everything here:
- Allocate L_Key/STag. This allocates MR resources without actually
registering memory; the MR can then be registered or invalidated as
described below.
- "Fast register" memory through send queue. This allows a work
request to be posted to a send queue to register memory using an
L_Key/STag that is in the invalid state.
- Local invalidate send work requests, which can be used to
invalidate an MR or MW. One subtle point here is that local
invalidate operations have very loose ordering, in the sense that
they can be executed before earlier requests, but support for
fencing local invalidate operations is mandatory in iWARP and only
optional in IB. But is there any IB device that currently exists
that supports BMME but doesn't support local invalidate fencing?
I really hope we can ignore this possibility.
- Memory windows associated to a single QP and bound using send work
requests posted with the normal post send verb rather than a
separate MW verb. (See below for more)
In addition there are things that are optional in both specs:
- Block-list physical buffer lists; this allows memory regions to be
registered with arbitrary size/alignment blocks instead of just
page-aligned chunks. Yet another capability bit if we want to
expose this.
There are a few discrepancies between the iWARP and IB verbs that we
need to decide on how we want to handle:
- In IB-BMME, L_Keys and R_Keys are split up so that there is an
8-bit "key" that is owned by the consumer. As far as I know, there
is no analogous concept defined for iWARP STags; is there any point
in supporting this IB-only feature (which is optional even in the
IB spec)?
In fact there is an 8b key for stags as well. The stag is composed of a
3B index allocated by the driver/hw, and a 1B key specified by the
consumer. None of this is exposed in the linux rdma interface at this
point and cxgb3 always sets the key to 0xff.
- Along similar lines, IB defines two types of memory windows, "type
1" and "type 2" and in fact type 2 is split into "2A" and "2B" (the
difference is basically whether the MW is associated with just a
QP, or with a QP and a PD). iWARP memory windows are always what
the IB spec would call type 2B. All the IB devices that I know of
with IB-BMME support can handle type 2B memory windows. Is there
any point in having our API worry about the distinction between 2A
or 2B, or should we just decree that we only handle type 2B? (Does
anyone who hasn't just been reading specs even understand the
distinction between type 2A and 2B?)
- Further, the MW API that we have now, with a separate bind MW verb,
corresponds to type 1 MWs. Type 2 MWs are bound by posting a work
request using the standard "post send" verb. Given that no IB
device drivers have implemented the bind MW verb yet, does it make
sense to deprecate the API for type 1 MWs and say that everyone
should use type 2[B] MWs only?
The chelsio driver supports the iwarp bind_mw SQ WR via the current API.
In fact the current API implies that this call is actually a SQ
operation anyway:
/**
* ib_bind_mw - Posts a work request to the send queue of the specified
* QP, which binds the memory window to the given address range and
* remote access attributes.
How is the current bind_mw API not valid or correct for iwarp MWs? Other
than being a different call than ib_post_send()?
- iWARP supports "RDMA read with invalidate" send work requests,
while IB has no such operation. This makes sense because iWARP
requires the buffer used to receive RDMA read responses to have
remote write permission, while IB has no such requirement. I don't
see a really clean way to handle this except to say that apps have
to have "if (IB) do_this(); else /* iWARP */ do_that();" code to
use this in a portable way.
Or a transport independent app can always use 2 WRs, read +
inv-local-stag/fenced instead of read-inv-local-stag.
- Zero-based virtual addresses for memory regions. This is mandatory
for iWARP and optional for IB (and is not required even for BMME).
I think the simplest thing to do is just to have yet another
capability bit to say whether a device supports ZBVA or not; all
iWARP devices can set it.
Currently, nobody is using this nor the block mode feature. I don't
think we should bother supporting them unless someone has an app in mind
that will utilize them.
Finally, there are proprietary verbs extensions that are only
supported by a single device at the moment, which we have to decide if
and how to support. It is a tradeoff between making useful features
available versus making the already overly complex verbs API even more
impossible to fathom, although it seems all of these have users asking
for them:
- ConnectX has XRC, masked atomic operations, and the "block
loopback" flag for UD QPs at least.
- eHCA has "low-latency" QPs.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general