I've been seeing some stale connection collisions, as a result of one of my test hosts being rebooted much more frequently then the other.
Tests to handle stale connections are not in the current code. (There's some commented out portions of it, but the checks aren't where they need to be.) My plan is to add this when adding in timewait checking.
Specifically one of my nodes had two connections with the same remote communications ID and different local communications IDs, when the remote node received a DREQ from this node, a DREQ_RCVD was generated for the given local ID whithout checking to see if the remote ID matched, which it didn't. Since the remote node was back from a fresh reboot in both cases that generated the local ID, the local QPN was the same as well.
Currently the dreq_handler checks the DREQ:remote_comm_id and remote_qpn. Since you have the same QPN, you're hitting this issue. If the stale connection tests mentioned above were finished, this second connection wouldn't have occurred.
I think that all applicable messages should check both IDs.
This isn't overly difficult to add. My thinking on the CM implementation was to treat the remote ID as opaque, so that the local CM didn't need to make any assumptions about how the remote IDs were assigned or used. I'll add in checks against the remote ID (and reject if invalid).
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
