My main question has to do with an error path in cm_req_handler. If cm_init_av fails (lines 1098 or 1103), I get the following crash:
Also, this fixes the crash when this occurs but the removal of the CM module now hangs.
Any easy way to reproduce this is to clear out the path record DGID before sending REP.
an update...
I've been able to reproduce this, and what's happening is that the cm_id that the CM created to handle the REQ is hanging waiting for its reference count to go to 0, but I'm not entirely sure why yet.
The REQ is received and processed in a CM controlled work queue. After seeing the error, the CM sends a REJ message to the sender. (The code to set the proper reject code is not there yet, but a REJ should still be delivered.) As a result of sending the REJ, the reference count on the cm_id is incremented. The CM then waits in the CM work queue thread for the send to complete, which would decrement the reference count.
The send completion should be processed from the context of the MAD layer controlled work queue, so I'm not sure why it's not getting called. My planned long term fix is to allow the REJ to be sent without holding a reference on the cm_id. But there's a similar issue sending a DREQ or DREP when destroying a cm_id. So, I'm trying to understand this more.
- Sean
_______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
