Hi Ira,

On 13:38 Wed 23 Apr     , Ira Weiny wrote:
> 
> The symptom is that nodes drop out of the IPoIB mcast group after a node
> temporarily goes catatonic.  The details are:
> 
>    1) Issues on a node cause a soft lockup of the node.
>    2) OpenSM does a normal light sweep.
>    3) MADs to the node time out since the node is in a "bad state"

Normally during light sweep OpenSM will not query nodes. I think OpenSM
should not detect such soft lockup unless ib link state was changed and
heavy sweep was triggered. Is this the case?

>    4) OpenSM marks the node down and drops it from internal tables, including
>       mcast groups.
>    5) Node recovers from soft lock up condition.
>    6) A subsequent sweep causes OpenSM see the node and add it back to the
>       fabric.
>    7) Node is fully functional on the verbs layer but IPoIB never knew 
> anything
>       was wrong so it does _not_ rejoin the mcast groups.  (This is different
>       from the condition where the link actually goes down.)

If my approach above is correct it should be same as port down/up
handling. And as was noted already in this thread OpenSM should ask
for reregistration (by setting client reregistration bit).

I see your patch - seems this part is buggy in OpenSM now, will see
closer to this.

Sasha
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to