On Thu, 2007-05-10 at 23:10 -0400, Donald Kerr wrote:
> 
> Caitlin Bestler wrote:
> 
> >devel-boun...@open-mpi.org wrote:
> >  
> >
> >>>There are two new issues so far:
> >>>
> >>>1) this has uncovered a connection migration issue in the Chelsio
> >>>driver/firmware.  We are developing and testing a fix for this now.
> >>>Should be ready tomorrow hopefully.
> >>>
> >>>      
> >>>
> >>I have a fix for the above issue and I can continue with OMPI testing.
> >>
> >>To work around the client-must-send issue, I put a nice fat
> >>sleep in the udapl btl right after it calls dat_cr_accept(),
> >>in mca_btl_udapl_accept_connect().  This, however, exposes
> >>another issue with the udapl btl:
> >>    
> >>
> sleeping after accept? What are you trying to do here force a race 
> condition?
> 

More like trying to work around the race condition that exists:  The
server side sends an rdma message first thus violating the iwarp
protocol.  For those who want the gory details:  when the server sends
first -and- that rdma message arrives at the client _before_ the client
transitions into rdma mode, then that rdma message gets passed up to the
host driver as streaming mode data.  So when I originally ran OMPI/udapl
on chelsio's rnic, the client actually received the mpa start response
-and- the first fpdu from the server while still in streaming mode.
This results in a connection abort because it violates the mpa startup
protocol...

By causing the server to sleep just after accepting the connection, I
give the client time to transition into rdma mode...

It was just a hack to continue testing OMPI/udapl/chelsio.  And it
revealed the problem being discussed in this thread:  OMPI udapl btl
doesn't pre-post the recvs for the sendrecv exchange.


> >>Neither the client nor the server side of the udapl btl
> >>connection setup pre-post RECV buffers before connecting.
> >>This can allow a SEND to arrive before a RECV buffer is
> >>available.  I _think_ IB will handle this issue by
> >>retransmitting the SEND.  Chelsio's iWARP device, however,
> >>TERMINATEs the connection.  My sleep() makes this condition
> >>happen every time.
> >>
> >>    
> >>
> >
> >A compliant DAPL program also ensures that there are adequate
> >receive buffers in place before the remote peer Sends. It is
> >explicitly noted that failure to follow this real will invoke
> >a transport/device dependent penalty. It may be that the sendq
> >will be fenced, or it may be that the connection will be terminated.
> >
> >So any RDMA BTL should pre-post recv buffers before initiating or
> >accepting a connection.
> >
> >  
> >
> I know of no udapl restiction saying a recv must be posted before a send.
> And yes we do pre post recv buffers but since the BTL creates 2 
> connections per peer, one for eager size messages and one for max size 
> messages the BTL needs to know which connection the current endpoint is 
> to service so that it can post the proper size recv buffer.
> 
> Also, I agree in theory the btl could potentially post the recv which 
> currently occurs in mca_btl_udapl_sendrecv before the connect or accept 
> but I think in practise we had issue doing this and we had to wait until 
> a DAT_CONNECTION_EVENT_ESTABLISHED was received.
> 

Well I'm trying it now.  It should work.  If it doesn't, then dapl or
the underlying providers are broken.

> >
> >  
> >
> >>>From what I can tell, the udapl btl exchanges memory info as a first
> >>>      
> >>>
> >>order of business after connection establishment
> >>(mba_btl_udapl_sendrecv().  The RECV buffer post for this
> >>exchange, however, should really be done _before_ the
> >>dat_ep_connect() on the active side, and _before_ the
> >>dat_cr_accept() on the server side.
> >>Currently its done after the ESTABLISHED event is dequeued,
> >>thus allowing the race condition.
> >>
> >>I believe the rules are the ULP must ensure that a RECV is
> >>posted before the client can post a SEND for that buffer.
> >>And further, the ULP must enforce flow control somehow so
> >>that a SEND never arrives without a RECV buffer being available.
> >>    
> >>
> maybe this is a rule iwarp imposes on its ULPs but not uDAPL.
> 
> >>Perhaps this is just a bug and I opened it up with my sleep()
> >>
> >>Or is the uDAPL btl assuming the transport will deal with
> >>lack of RECV buffer at the time a SEND arrives?
> >>    
> >>
> There may be a race condition here but you really have to try hard to 
> see it.
> 

I agree its a small race condition that I made very large by my
sleep! :-)  But I can kill 2 birds with one stone here:  I'm testing now
a change to the sendrecv exchange to:

1) prepost the recvs just after dat_ep_create

2) force the side that issues the dat_ep_connect to send its addr-data
first

3) force the side that issues the dat_cr_accept to wait for the send
from the peer, then post its addr-data send

This will plug both race conditions.  I'll post the patch once I debug
it and we can discuss if you thinks a good solution and/or if it work
for solaris udapl.

>  From Steve  previously.
> "Also: Given there is a message exchange _always_ after connection 
> setup, then we can change that exchange to support the 
> client-must-send-first issue..."
> 
> I agree I am sure we can do something but if it includes an additional 
> message we should consider a mca parameter to govern this because the 
> connection wireup is already costly enough.
> 

It won't add an additional message.  Stay tuned for a patch!

Steve.

Reply via email to