Ira Weiny wrote:
The symptom is that nodes drop out of the IPoIB mcast group after a node
temporarily goes catatonic.  The details are:

   1) Issues on a node cause a soft lockup of the node.
   2) OpenSM does a normal light sweep.
   3) MADs to the node time out since the node is in a "bad state"
   4) OpenSM marks the node down and drops it from internal tables, including
      mcast groups.
   5) Node recovers from soft lock up condition.
   6) A subsequent sweep causes OpenSM see the node and add it back to the
      fabric.
As Hal noted, client reregister is the way to go.

In a similar discussion in the past the conclusion was that the SM should (maybe even according to the spec, but according to common sense is fine as well, I think) set the re-register bit where in that case IPoIB rejoins and we are done. At the time, I understood that openSM would do so (http://lists.openfabrics.org/pipermail/general/2007-September/041237.html), am I wrong, or maybe the case brought on that thread (switch/port going down and a whole sub fabric is removed from the SM point of view where the links remain up from the view point of the nodes) was different? the basic point is a case where a node link is UP and the SM lost this node for some time and now sees it again. We used to call it "the active/active" transition and an SM maybe need special logic for it.

Or.

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to