On Sun, 27 Apr 2008 17:11:40 +0000
Sasha Khapyorsky <[EMAIL PROTECTED]> wrote:

> Hi Ira,
> 
> On 13:38 Wed 23 Apr     , Ira Weiny wrote:
> > 
> > The symptom is that nodes drop out of the IPoIB mcast group after a node
> > temporarily goes catatonic.  The details are:
> > 
> >    1) Issues on a node cause a soft lockup of the node.
> >    2) OpenSM does a normal light sweep.
> >    3) MADs to the node time out since the node is in a "bad state"
> 
> Normally during light sweep OpenSM will not query nodes. I think OpenSM
> should not detect such soft lockup unless ib link state was changed and
> heavy sweep was triggered. Is this the case?

Yes I agree.  Per my previous mail to Or I found that light sweeps did not in
fact notice the nodes were gone.  Looking at the logs I am not sure what
caused OpenSM to notice them.  However, something must have triggered a heavy
sweep when those nodes were catatonic.  From the logs they were unresponsive
for multiple seconds, some as long as 30s.  It is still a bit of a mystery why
OpenSM did a heavy sweep during this period but I don't think it is
unreasonable for it to do so.

> 
> >    4) OpenSM marks the node down and drops it from internal tables, 
> > including
> >       mcast groups.
> >    5) Node recovers from soft lock up condition.
> >    6) A subsequent sweep causes OpenSM see the node and add it back to the
> >       fabric.
> >    7) Node is fully functional on the verbs layer but IPoIB never knew 
> > anything
> >       was wrong so it does _not_ rejoin the mcast groups.  (This is 
> > different
> >       from the condition where the link actually goes down.)
> 
> If my approach above is correct it should be same as port down/up
> handling. And as was noted already in this thread OpenSM should ask
> for reregistration (by setting client reregistration bit).
> 
> I see your patch - seems this part is buggy in OpenSM now, will see
> closer to this.
> 

Yes I believe this is all fixed.

Thanks again for everyone's help on this,
Ira

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to