On Tue, 2007-05-22 at 20:01, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >The one I see that might be related is the following: > > > >commit 39798695b4bcc7b145f8910ca56195808d3a7637 > >Author: Roland Dreier <[EMAIL PROTECTED]> > >Date: Mon Nov 13 09:38:07 2006 -0800 > > > > IB/mad: Fix race between cancel and receive completion > > > > When ib_cancel_mad() is called, it puts the canceled send on a list > > and schedules a "flushed" callback from process context. However, > > this leaves a window where a receive completion could be processed > > before the send is fully flushed. > > > > This is fine, except that ib_find_send_mad() will find the MAD and > > return it to the receive processing, which results in the sender > > getting both a successful receive and a "flushed" send completion for > > the same request. Understandably, this confuses the sender, which is > > expecting only one of these two callbacks, and leads to grief such as > > a use-after-free in IPoIB. > > > > Fix this by changing ib_find_send_mad() to return a send struct only > > if the status is still successful (and not "flushed"). The search of > > the send_list already had this check, so this patch just adds the same > > check to the search of the wait_list. > > > > Signed-off-by: Roland Dreier <[EMAIL PROTECTED]> > > > >My search was not exhaustive. > > > > > It looks like this may be the fix for the MAD send errors.
Perhaps. > Do you > think this is the cause of opensm not grabbing the mastership from the > other ? Unlikely but don't know for sure. > >Are they incrementing ? Which node is this ? I think some of them would > >increment on node reboot. > > > > > Looks like some counters (Symbol errors, link downed) are reached the > top ceiling. You should replace the cable and see if symbol errors improves. You may need to clear these with perfquery -R. I think Link downed will increment when the node reboots. > This output was captured on node vortex3l-83, the one who runs opensm. > Do you want the perfquery output before and after some time interval ? I'm interested in VL15 drops to make sure that is not going on. -- Hal > VBabu _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
