Hello,
I am using rdmacm as my connection manager and I have N processes trying to
establish a complete mesh of connections over RC protocol.

I am receiving Connection rejected event with error 10.  As per the
IBTA, the Connection Server
is rejecting the connection requests thinking that it is a stale connection.

I read an earlier response from Sean stating that it is advisable to
recreate the QP and retry
the connection. That is exactly what I am doing.
Everytime I get err 10, I destroy the cmid and the attached QP, and go
through the entire
connection management again. (create a new cmid, resolve addr/resolve
route/ connect)

The reconnect attempt succeeds until the Route resolve step, at which
point I create a new QP, and issue the connect.
I am stuck in this loop where I keep getting disconnects after
repeated attempts.
It almost seems as if the underlying QP number is being recycled and
the remote Connection server keeps rejecting the connection requests.

As per the IBTA:
"A CM may receive a REQ/REP specifying a remote QPN in
“REQ:local QPN”/”REP:local QPN” that the CM already considers connected
to a local QP. A local CM may receive such a REQ/REP if its local
QP has a stale connection, as described in section 12.4.1. When a CM
receives such a REQ/REP it shall abort the connection establishment by
issuing REJ to the REQ/REP. It shall then issue DREQ, with “DREQ:remote
QPN” set to the remote QPN from the REQ/REP, until DREP is received
or Max Retries is exceeded, and place the local QP in the TimeWait state"

So, the connection Server should issue a disconnect request. But the connecting
client does not get any disconnect requests..? Should librdmacm be
sending out this
disconnect..? Could this be a bug in librdmacm..?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to