> 
> There are two new issues so far:
> 
> 1) this has uncovered a connection migration issue in the Chelsio
> driver/firmware.  We are developing and testing a fix for this now.
> Should be ready tomorrow hopefully.
> 

I have a fix for the above issue and I can continue with OMPI testing.

To work around the client-must-send issue, I put a nice fat sleep in the
udapl btl right after it calls dat_cr_accept(), in
mca_btl_udapl_accept_connect().  This, however, exposes another issue
with the udapl btl:

Neither the client nor the server side of the udapl btl connection setup
pre-post RECV buffers before connecting.  This can allow a SEND to
arrive before a RECV buffer is available.  I _think_ IB will handle this
issue by retransmitting the SEND.  Chelsio's iWARP device, however,
TERMINATEs the connection.  My sleep() makes this condition happen every
time.  

>From what I can tell, the udapl btl exchanges memory info as a first
order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this exchange,
however, should really be done _before_ the dat_ep_connect() on the
active side, and _before_ the dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued, thus
allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is posted before
the client can post a SEND for that buffer.  And further, the ULP must
enforce flow control somehow so that a SEND never arrives without a RECV
buffer being available.

Perhaps this is just a bug and I opened it up with my sleep()

Or is the uDAPL btl assuming the transport will deal with lack of RECV
buffer at the time a SEND arrives?

Also: Given there is a message exchange _always_ after connection setup,
then we can change that exchange to support the client-must-send-first
issue...


Steve.




Reply via email to