On Fri, 22 Oct 2010, Bob Gilligan wrote:
> On 10/22/2010 3:41 PM, Brandeburg, Jesse wrote:
> > On Fri, 22 Oct 2010, Chris Friesen wrote:
> >> On 10/22/2010 11:06 AM, Chris Friesen wrote:
> >>> On 10/12/2010 11:08 AM, Chris Friesen wrote:
> >>>> On 10/08/2010 04:36 PM, Brandeburg, Jesse wrote:
> >>>>
> >>>>> seems reasonable, it should work okay.  Does it fix the problem?  It 
> >>>>> seems
> >>>>> there must be a race between when the interrupt gets re-enabled and when
> >>>>> the hardware clears the mask via EIAM on the next interrupt.
> >>>>>
> >>>> I'm about to give it a try.  The problem can take hours to reproduce, so
> >>>> we won't know for a day or so whether it's really gone.
> >>>>
> >>> It looks like the attached patch makes our problem go away.  I only did
> >>> the msix/NAPI code path, so a complete solution would need some more
> >>> changes.
> >>>
> >>> Where do we go from here?  If this is something that occurs on other
> >>> boards would it make sense for the driver to provide a way to turn off
> >>> the automasking?  (Module parameter perhaps?)
> >
> > The question becomes why haven't we been able to reproduce this and why
> > haven't we seen it before?  I'm betting that there is something wrong with
> > the MSI-X semantics of either your kernel or the system hardware.

Oops, sorry chris, I always think you're working on PPC. :-)  I still 
don't understand the complete lack of corroborating data.

> We saw an rx-side lockup similar to the one Chris reports a couple 
> months back, and reported it on this list:
> 
> http://sourceforge.net/mailarchive/message.php?msg_name=4C361F3F.7000604%40vyatta.com
> 
> Not sure if it is exactly the same problem, but the symptoms are the 
> quite similar (rx side stops, tx side keeps working, link stays up).

I would be convinced if you could show me that the EIMS register shows 1 
or more of the queue bits disabled.

> We saw the problem in the field, but had a great deal of difficulty 
> reproducing this problem in the lab.  We finally found a combination of 
> load from an analyzer, a particular traffic mix and link flaps that 
> triggers the symptom reliably in 10 or 20 minutes.  Haven't yet had a 
> chance to try Chris' patch.  But I'll try it and report the results.

the problem induced via link flaps is similar to something else we are 
looking at, and I don't think it is related to Chris' problem (not 100% 
sure).
 
> (BTW, I'm testing on a Supermicro platform with 2 x Xeon x5570 CPUs, 
> 2.6.32 kernel, and version 2.1.4-NAPI of the ixgbe driver.)
> 
> You mentioned in an earlier message that these symptoms could be due to 
> overrunning the IRQ stack with all the interrupts from all the queues on 
> this NIC.  Is there a way to confirm that this is or isn't happening? 
> Some statistic perhaps?

if you rebuild your kernel with a newer GCC that supports stack canary and 
CONFIG_CC_STACKPROTECTOR_ALL=y
CONFIG_CC_STACKPROTECTOR=y

then usually it will catch the stack corruption quickly if it is occuring.  
In my case to repro that issue I just had to reload the driver several 
times in a row with 16 or more cpus online, and the first time the 
watchdog triggered all 16 interrupts simultaneously the stack corruption 
would occur.  As I mentioned the (pretty simple) patch to serialize hard 
interrupt handlers fixes/works around it.

Jesse

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to