Libor Michalek wrote:
  I've been seeing some stale connection collisions, as a result of one
of my test hosts being rebooted much more frequently then the other.

Tests to handle stale connections are not in the current code. (There's some commented out portions of it, but the checks aren't where they need to be.) My plan is to add this when adding in timewait checking.


  Specifically one of my nodes had two connections with the same remote
communications ID and different local communications IDs, when the remote
node received a DREQ from this node, a DREQ_RCVD was generated for the
given local ID whithout checking to see if the remote ID matched, which
it didn't. Since the remote node was back from a fresh reboot in both
cases that generated the local ID, the local QPN was the same as well.

Currently the dreq_handler checks the DREQ:remote_comm_id and remote_qpn. Since you have the same QPN, you're hitting this issue. If the stale connection tests mentioned above were finished, this second connection wouldn't have occurred.


I think that all applicable messages should check both IDs.

This isn't overly difficult to add. My thinking on the CM implementation was to treat the remote ID as opaque, so that the local CM didn't need to make any assumptions about how the remote IDs were assigned or used. I'll add in checks against the remote ID (and reject if invalid).
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general


To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to