It looks to me like ucma_clean_events() calls
rdma_destroy_id() /
iw_destroy_cm_id() / destroy_cm_id() which calls the
provider reject
function. Or NOT! :) There's a comment in the
IW_CM_STATE_CONN_RECV
case inside destroy_cm_id():
/*
* App called destroy
before/without calling accept after
* receiving connection request
event notification or
* returned non zero from the
event callback function.
* In either case, must tell the
provider to reject.
*/
But I don't see the call to reject the connection...
Maybe you could add it and see if it clears up your
issue?
I haven't hit a problem yet, I am looking at what my
driver should/should not do ...
Doesn't this sound like a problem (namely
provider/card resource leak due to races with
listener
destruct)?
It does.
But MPA mandates a timeout so the connections will
get aborted
eventually by the provider or peer...
I believe the timeout you are talking about applies to
limiting how long it takes (on responder side) from an
incoming SYN to receipt of complete MPA request. I
don't believe there is much logic in having a timeout
between the incoming-connect upcall send by the driver
and an eventual accept/reject done by the app, but
thats a seperate discussion.
My point is the peer will abort the TCP connection if the passive side
never accepts or rejects.
The core problem is this though. On a listener
destruct, the driver can either do:
a. destroy all children on which an accept/reject has
not yet been invoked, and OFA stack then must stop app
from sending an accept/reject down in such case. There
is currently an attempt to do this at the ucma layer
(eg cleanup unpolled events), but it is not race free.
This code is only cleaning up cm_id's that have _not_ been reaped by the
application via get_rdma_cm_event(). Any connection requests that have
been reaped will stay around until the application disposes of them via
rdma_accept(), rdma_reject(), rdma_destroy_id(), or when the process exists.
b. OFA guarantees than an eventual accept/reject
downcall will be made, and driver can rely on that to
prevent resource leakage.
Yes I think the rdma core must guarantee an eventual accept/reject downcall.
Any other solution will have some problem somewhere.
EG, in your timeout suggestion, if the driver goes
ahead and cleans up the state on on-card resource for
the child, due to the race mentioned in a) above, the
app might succeed in making an eventual accept/reject,
leading to a kernel crash.
But I think you've found a bug...
Steve.
Are folks filing bugs in bugzilla or similar?
You can if you want. There is a bugzilla db on the ofa site...
Or provide the fix and test it. That would be ideal...
Steve.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general