On Sun, 27 Apr 2008 17:11:40 +0000 Sasha Khapyorsky <[EMAIL PROTECTED]> wrote:
> Hi Ira, > > On 13:38 Wed 23 Apr , Ira Weiny wrote: > > > > The symptom is that nodes drop out of the IPoIB mcast group after a node > > temporarily goes catatonic. The details are: > > > > 1) Issues on a node cause a soft lockup of the node. > > 2) OpenSM does a normal light sweep. > > 3) MADs to the node time out since the node is in a "bad state" > > Normally during light sweep OpenSM will not query nodes. I think OpenSM > should not detect such soft lockup unless ib link state was changed and > heavy sweep was triggered. Is this the case? Yes I agree. Per my previous mail to Or I found that light sweeps did not in fact notice the nodes were gone. Looking at the logs I am not sure what caused OpenSM to notice them. However, something must have triggered a heavy sweep when those nodes were catatonic. From the logs they were unresponsive for multiple seconds, some as long as 30s. It is still a bit of a mystery why OpenSM did a heavy sweep during this period but I don't think it is unreasonable for it to do so. > > > 4) OpenSM marks the node down and drops it from internal tables, > > including > > mcast groups. > > 5) Node recovers from soft lock up condition. > > 6) A subsequent sweep causes OpenSM see the node and add it back to the > > fabric. > > 7) Node is fully functional on the verbs layer but IPoIB never knew > > anything > > was wrong so it does _not_ rejoin the mcast groups. (This is > > different > > from the condition where the link actually goes down.) > > If my approach above is correct it should be same as port down/up > handling. And as was noted already in this thread OpenSM should ask > for reregistration (by setting client reregistration bit). > > I see your patch - seems this part is buggy in OpenSM now, will see > closer to this. > Yes I believe this is all fixed. Thanks again for everyone's help on this, Ira _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
