Gleb Natapov wrote:
On Sun, Mar 09, 2008 at 02:48:09PM -0500, Jon Mason wrote:
Issue (as described by Steve Wise):
Currently OMPI uses qp 0 for all credit updates (by design). This breaks
when running over the chelsio rnic due to a race condition between
advertising the availability of a buffer using qp0 when the buffer was
posted on one of the other qps. It is possible (and easily reproducible)
that the peer gets the advertisement and sends data into the qp in question
_before_ the rnic has processed the recv buffer and made it available for
placement. This results in a connection termination. BTW, other hca's
have this issue too. ehca, for example, claims they have the same race
condition. I think the timing hole is much smaller though for devices that
have 2 separate work queues for the SQ and RQ of a QP. Chelsio has a
single work queue to implement both SQ and RQ, so processing of RQ work
requests gets queued up behind pending SQ entries which can make this race
condition more prevalent.
There was a discussion about this on openfabrics mailing list and the
conclusion was that what Open MPI does is correct according to IB/iWarp
spec.
Hey Gleb. Yes, the conclusion was the rdma device and driver should
ensure this. But also note that the ehca IB device also has this same
race condition. So I wonder if the other IB devices really do also have
this race condition? I think it is worse for the cxgb3 device due to
its architecture (a single queue for both send and recv work requests).
I don't know of any way to avoid this issue other that to ensure that all
credit updates for qp X are posted only on qp X. If we do this, then the
chelsio HW/FW ensures that the RECV is posted before the subsequent send
operation that advertises the buffer is processed.
Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not return before
post receive is processed?
I've been trying come up with a solution in the lib/driver/fw to enforce
this behavior. The only way I can see doing it is to follow the recv
work requests with a 0B write work request, and spinning or blocking
until the 0B write completes (note: 0B write doesn't emit anything on
the wire for the cxgb3 device). This will guarantee that the recv's are
ready before returning from the libcxgb3 post_recv function. However
this is problematic because there can be real OMPI work completions in
the CQ that need processing. So I don't know how to do this in the
driver/library.
Also note, any such solution will entirely drain the SQ each time a recv
is posted. This will kill performance.
(just thinking out loud here): The OMPi code could be designed to _not_
assume recv's are posted until the CPC indicates they are ready. IE sort
of asynchronous behavior. When the recvs are ready, the CPC could
up-call the btl and then the credits could be updated. This sounds
painful though :)
To address this Jeff Squyres recommends:
1. make an mca parameter that governs this behavior (i.e., whether to send
all flow control messages on QP0 or on their respective QPs)
2. extend the ini file parsing code to accept this parameter as well (need
to add a strcmp or two)
3. extend the ini file to fill in this value for all the nic's listed (to
include yours).
4. extend the logic in the rest of the btl to send the flow control
messages either across qp0 or the respective qp, depending on the value of
the mca param / ini value.
I am happy to do the work to enable this, but I would like to get
everyone's feed back before I start down this path. Jeff said Gleb did
the work to change openib to behave this way, so any insight would be
helpful.
I personally don't like the idea to add another layer of complexity to openib
BTL code just to work around HW that doesn't follow spec. If work around
is simple that is OK, but in this case it is not so simple and will add
code path that is rarely tested. A simple workaround for the problem may
be to not configure multiple QPs if HW has a bug (and we can extend ini
file to contain this info).
It doesn't sound too complex to implement the above design. In fact,
that's the way this btl used to work, no? There are large customers
requesting OMPI over cxgb3 and we're ready to provide the effort to get
this done. So I request we come to an agreement on how to support this
device efficiently (and for ompi-1.3).
On the single-QP angle, Can I just run OMPI with only specifying 1 QP?
Or will that require coding changes?
Thanks!
Steve.