Re: [OMPI devel] OMPI over ofed udapl over iwarp

Donald Kerr Thu, 10 May 2007 23:11:08 -0400


Caitlin Bestler wrote:

[email protected] wrote:

There are two new issues so far:

1) this has uncovered a connection migration issue in the Chelsio
driver/firmware.  We are developing and testing a fix for this now.
Should be ready tomorrow hopefully.

I have a fix for the above issue and I can continue with OMPI testing.

To work around the client-must-send issue, I put a nice fat
sleep in the udapl btl right after it calls dat_cr_accept(),
in mca_btl_udapl_accept_connect().  This, however, exposes
another issue with the udapl btl:

sleeping after accept? What are you trying to do here force a racecondition?

Neither the client nor the server side of the udapl btl
connection setup pre-post RECV buffers before connecting.
This can allow a SEND to arrive before a RECV buffer is
available.  I _think_ IB will handle this issue by
retransmitting the SEND.  Chelsio's iWARP device, however,
TERMINATEs the connection.  My sleep() makes this condition
happen every time.


A compliant DAPL program also ensures that there are adequate
receive buffers in place before the remote peer Sends. It is
explicitly noted that failure to follow this real will invoke
a transport/device dependent penalty. It may be that the sendq
will be fenced, or it may be that the connection will be terminated.

So any RDMA BTL should pre-post recv buffers before initiating or
accepting a connection.

I know of no udapl restiction saying a recv must be posted before a send.

And yes we do pre post recv buffers but since the BTL creates 2connections per peer, one for eager size messages and one for max sizemessages the BTL needs to know which connection the current endpoint isto service so that it can post the proper size recv buffer.

Also, I agree in theory the btl could potentially post the recv whichcurrently occurs in mca_btl_udapl_sendrecv before the connect or acceptbut I think in practise we had issue doing this and we had to wait untila DAT_CONNECTION_EVENT_ESTABLISHED was received.

From what I can tell, the udapl btl exchanges memory info as a first

order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this
exchange, however, should really be done _before_ the
dat_ep_connect() on the active side, and _before_ the
dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued,
thus allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is
posted before the client can post a SEND for that buffer.
And further, the ULP must enforce flow control somehow so
that a SEND never arrives without a RECV buffer being available.

maybe this is a rule iwarp imposes on its ULPs but not uDAPL.

Perhaps this is just a bug and I opened it up with my sleep()

Or is the uDAPL btl assuming the transport will deal with
lack of RECV buffer at the time a SEND arrives?

There may be a race condition here but you really have to try hard tosee it.


From Steve  previously.

"Also: Given there is a message exchange _always_ after connectionsetup, then we can change that exchange to support theclient-must-send-first issue..."

I agree I am sure we can do something but if it includes an additionalmessage we should consider a mca parameter to govern this because theconnection wireup is already costly enough.


-DON


No. uDAPL *allows* a provider to compensate for this through
unspecified means, but the application MUST NOT rely on it
(on the flip side the application MUST NOT rely on any
mistake generating a fault. That's akin to relying on
a state trooper pulling you over when you exceed the
speed limit. It is always possible that your application
has too many buffers in flight but this is never detected
because the new buffers are posted before the messages
actually arrive. Your not supposed to do that, but you
have a good chance of getting away with it).

As a general rule DAPL *never* requires a provider to
check anything that the provider does not need to check
on its own (other than memory access rights). So typically
the provider will complain about too many buffers when it
actually runs out of buffers, not when the application's
end-to-end credits are theoretically negative. A "fast

path" interface becomes a lot less so if every workrequest is validated dynamically against every relevant

restriction.


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OMPI over ofed udapl over iwarp

Reply via email to