Re: [openib-general] RFC on CM error handling

Sean Hefty Fri, 21 Jan 2005 14:29:40 -0800

Libor Michalek wrote:

From the REP callback, even if the call to send an RTU is successful, a REJ could still be received. (The remote side timed out waiting for the RTU.) Locally, the cm_id state went from REP_RCVD to ESTABLISHED to TIMEWAIT. Given this, it seems that there are missing state transitions in the spec handling a REJ from REP_RCVD or MRA_REP_SENT states, which would drive the state back to IDLE.
  I think this state transition is ignored, since data transfer will
detect the situation. After the RTU is sent and the connection is
transfered to ESTALISHED, the QP is transitioned to RTS, a posted
send will result in a error completion, since the remote QP has been
destroyed and will either not ack or nack the data. Applications that
care about detecting that a connection, which is not transfering data,
is healthy should perform zero byte RDMA writes...

I agree that the CM could ignore this transition. The CM can probably ignore all REJ messages and rely on timeouts (which is why I haven't coded that portion yet...). Long term I think the CM should attempt to handle REJ in all valid states. From the CM's perspective, the required effort appears to be adding another case in a switch statement.

Along these same lines, there are a few more missing state transitions from the spec. A client can receive a DREQ from the REP_SENT state, receive a REP from DREQ_SENT, and receive a DREP from DREQ_RCVD. The CM will handle these by going from REP_SENT to DREQ_RCVD, resending the DREQ, or transitioning to TIMEWAIT, respectively.

- Sean

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] RFC on CM error handling

Reply via email to