Hello, I am using rdmacm as my connection manager and I have N processes trying to establish a complete mesh of connections over RC protocol.
I am receiving Connection rejected event with error 10. As per the IBTA, the Connection Server is rejecting the connection requests thinking that it is a stale connection. I read an earlier response from Sean stating that it is advisable to recreate the QP and retry the connection. That is exactly what I am doing. Everytime I get err 10, I destroy the cmid and the attached QP, and go through the entire connection management again. (create a new cmid, resolve addr/resolve route/ connect) The reconnect attempt succeeds until the Route resolve step, at which point I create a new QP, and issue the connect. I am stuck in this loop where I keep getting disconnects after repeated attempts. It almost seems as if the underlying QP number is being recycled and the remote Connection server keeps rejecting the connection requests. As per the IBTA: "A CM may receive a REQ/REP specifying a remote QPN in “REQ:local QPN”/”REP:local QPN” that the CM already considers connected to a local QP. A local CM may receive such a REQ/REP if its local QP has a stale connection, as described in section 12.4.1. When a CM receives such a REQ/REP it shall abort the connection establishment by issuing REJ to the REQ/REP. It shall then issue DREQ, with “DREQ:remote QPN” set to the remote QPN from the REQ/REP, until DREP is received or Max Retries is exceeded, and place the local QP in the TimeWait state" So, the connection Server should issue a disconnect request. But the connecting client does not get any disconnect requests..? Should librdmacm be sending out this disconnect..? Could this be a bug in librdmacm..? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
