Re: [OMPI devel] OMPI over ofed udapl over iwarp

Donald Kerr Sun, 13 May 2007 21:26:26 -0400


Caitlin Bestler wrote:

Donal Kerr wrote:

order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this exchange,
however, should really be done _before_ the
dat_ep_connect() on the active side, and _before_ the
dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued, thus

allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is posted
before the client can post a SEND for that buffer.
And further, the ULP must enforce flow control somehow so that a
SEND never arrives without a RECV buffer being available.

maybe this is a rule iwarp imposes on its ULPs but not uDAPL.


It is most assuredly a rule for uDAPL. And it is not a matter
of iWARP "imposing" on uDAPL. uDAPL was explicitly designed
to support IB, iWARP and VI. To do that DAPL documents its
model of what RDMA is.

(sorry I was off the grid for a couple of days)

Not to beat a dead horse but you would have to show me where in the Specit says I must post a recv before a send. And thinking about it some Idon't believe there is a race condition because this is not called outas such. Now if posting the handshake recv before the connect callspeeds things up and helps the iwarp scenario I am all for it.

This issue is in fact one that is truly fundamental to the
efficiency of RDMA -- the transport layer DOES NOT provide
buffering. That's the application's job. It is precisely
because the application layer does a better job that RDMA
can achieve better performance at high bandwidth.

For reasons that have been discussed in more depth in the
RDMA applicability statement and in RDDP/IPS discussions
on iSER, the absence of transport layer buffer throttling
places the onus for end-to-end pacing on the application.
It is a situation somewhat akin to a car with a broken
spedometer that had previously only driven during rush
hour bumper-to-bumper traffic. The fact that the spedometer
was broken was irrelevant. But if that same car hits the
open road the driver will need to come up with some method
of regulating their speed.

The DAPL semantics are very clear that send/recv operations must
be matched one to one, that the receive buffer must be large
enough for the received message and that there must be a receive
buffer for each incoming send/recv message. That means that
the sender needs to have some basis for believing that the
RECV has been posted. Usually this is an explicit credit
that is decremented per message and incremented per response.

Matching one to one sure, still does not say a recv must be postedbefore a send. Flow control is handled by the BTL.

What DAPL does not state is if the transport does explicit flow
control so that the sending application's work request is simply
not processed (and the sending application continues to provide
the buffer, as with InfiniBand) or whether the sender simply
transmits and leaves error detection to the receiver (iWARP).
There are theoretical advantages to both, but more importantly
neither of them is going to change. So the Consumer of RDMA
applications needs to use ULP/application layer flow control
to pace the transmitter. At the application layer that means
that the RECV must be posted *before* the Send/accept that
grants ULP credits to the far side.

All of that should be clear in the IOV ownership rules and
discussion of the semantics of send/recv. If you thought you
saw something that implied any guarantees to the contrary
then could you point them out in a posting to the DAT reflector?
(or just send them to me or Arkady Kanevsky).

I believe it was either your Steve who claimed a recv must be postedbefore a send thus leading to a race condition. I fail to see this. Butagain, if Steve's patch makes things better I am all for it.


-DON



_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OMPI over ofed udapl over iwarp

Reply via email to