On Thu, 24 Apr 2008 16:52:07 +0300 Or Gerlitz <[EMAIL PROTECTED]> wrote:
> Ira Weiny wrote: > > The symptom is that nodes drop out of the IPoIB mcast group after a node > > temporarily goes catatonic. The details are: > > > > 1) Issues on a node cause a soft lockup of the node. > > 2) OpenSM does a normal light sweep. > > 3) MADs to the node time out since the node is in a "bad state" > > 4) OpenSM marks the node down and drops it from internal tables, > > including > > mcast groups. > > 5) Node recovers from soft lock up condition. > > 6) A subsequent sweep causes OpenSM see the node and add it back to the > > fabric. > As Hal noted, client reregister is the way to go. > > In a similar discussion in the past the conclusion was that the SM > should (maybe even according to the spec, but according to common sense > is fine as well, I think) set the re-register bit where in that case > IPoIB rejoins and we are done. At the time, I understood that openSM > would do so > (http://lists.openfabrics.org/pipermail/general/2007-September/041237.html), > am I wrong, or maybe the case brought on that thread (switch/port going > down and a whole sub fabric is removed from the SM point of view where > the links remain up from the view point of the nodes) was different? the > basic point is a case where a node link is UP and the SM lost this node > for some time and now sees it again. We used to call it "the > active/active" transition and an SM maybe need special logic for it. > I have set up the following as a test situation switch B / \ (link X) switch A switch C / / \ Node1 node2 node3 (SM) When I down link X and re-enable it node 2 and 3 do _not_ rejoin the mcast group. Debug output from OpenSM indicates it is setting the rereg bit but I don't see the rejoin in the debug output from the node 2's IPoIB mcast layer. Perhaps there is a bug to be squashed here? Just in case anyone is curious, this is with OFED 1.2.5 on a RHEL 5.1 based kernel, and OpenSM 3.2.1-8341058-dirty. I am in the process of tracking this down, Ira _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
