[ofa-general] Re: RDMA/iwarp CM question

Steve Wise Wed, 12 Sep 2007 13:08:50 -0700

It looks to me like ucma_clean_events() calls

rdma_destroy_id() /iw_destroy_cm_id() / destroy_cm_id() which calls theprovider rejectfunction. Or NOT! :) There's a comment in theIW_CM_STATE_CONN_RECVcase inside destroy_cm_id():

                /*
                 * App called destroy

before/without calling accept after

                 * receiving connection request

event notification or

                 * returned non zero from the

event callback function.

                 * In either case, must tell the

provider to reject.

*/

But I don't see the call to reject the connection...

Maybe you could add it and see if it clears up your
issue?


I haven't hit a problem yet, I am looking at what my
driver should/should not do ...

Doesn't this sound like a problem (namely
provider/card resource leak due to races with

listener

destruct)?

It does.

But MPA mandates a timeout so the connections will

get abortedeventually by the provider or peer...


I believe the timeout you are talking about applies to
limiting how long it takes (on responder side) from an
incoming SYN to receipt of complete MPA request. I
don't believe there is much logic in having a timeout
between the incoming-connect upcall send by the driver
and an eventual accept/reject done by the app, but
thats a seperate discussion.

My point is the peer will abort the TCP connection if the passive sidenever accepts or rejects.

The core problem is this though. On a listener
destruct, the driver can either do:

a. destroy all children on which an accept/reject has
not yet been invoked, and OFA stack then must stop app
from sending an accept/reject down in such case. There
is currently an attempt to do this at the ucma layer
(eg cleanup unpolled events), but it is not race free.

This code is only cleaning up cm_id's that have _not_ been reaped by theapplication via get_rdma_cm_event(). Any connection requests that havebeen reaped will stay around until the application disposes of them viardma_accept(), rdma_reject(), rdma_destroy_id(), or when the process exists.

b. OFA guarantees than an eventual accept/reject
downcall will be made, and driver can rely on that to
prevent resource leakage.


Yes I think the rdma core must guarantee an eventual accept/reject downcall.

Any other solution will have some problem somewhere.
EG, in your timeout suggestion, if the driver goes
ahead and cleans up the state on on-card resource for
the child, due to the race mentioned in a) above, the
app might succeed in making an eventual accept/reject,
leading to a kernel crash.

But I think you've found a bug...

Steve.

Are folks filing bugs in bugzilla or similar?


You can if you want.  There is a bugzilla db on the ofa site...

Or provide the fix and test it. That would be ideal...


Steve.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: RDMA/iwarp CM question

Reply via email to