Ira Weiny wrote:
The symptom is that nodes drop out of the IPoIB mcast group after a node
temporarily goes catatonic. The details are:
1) Issues on a node cause a soft lockup of the node.
2) OpenSM does a normal light sweep.
3) MADs to the node time out since the node is in a "bad state"
4) OpenSM marks the node down and drops it from internal tables, including
mcast groups.
5) Node recovers from soft lock up condition.
6) A subsequent sweep causes OpenSM see the node and add it back to the
fabric.
As Hal noted, client reregister is the way to go.
In a similar discussion in the past the conclusion was that the SM
should (maybe even according to the spec, but according to common sense
is fine as well, I think) set the re-register bit where in that case
IPoIB rejoins and we are done. At the time, I understood that openSM
would do so
(http://lists.openfabrics.org/pipermail/general/2007-September/041237.html),
am I wrong, or maybe the case brought on that thread (switch/port going
down and a whole sub fabric is removed from the SM point of view where
the links remain up from the view point of the nodes) was different? the
basic point is a case where a node link is UP and the SM lost this node
for some time and now sees it again. We used to call it "the
active/active" transition and an SM maybe need special logic for it.
Or.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general