On Fri, 22 Oct 2010, Bob Gilligan wrote: > On 10/22/2010 3:41 PM, Brandeburg, Jesse wrote: > > On Fri, 22 Oct 2010, Chris Friesen wrote: > >> On 10/22/2010 11:06 AM, Chris Friesen wrote: > >>> On 10/12/2010 11:08 AM, Chris Friesen wrote: > >>>> On 10/08/2010 04:36 PM, Brandeburg, Jesse wrote: > >>>> > >>>>> seems reasonable, it should work okay. Does it fix the problem? It > >>>>> seems > >>>>> there must be a race between when the interrupt gets re-enabled and when > >>>>> the hardware clears the mask via EIAM on the next interrupt. > >>>>> > >>>> I'm about to give it a try. The problem can take hours to reproduce, so > >>>> we won't know for a day or so whether it's really gone. > >>>> > >>> It looks like the attached patch makes our problem go away. I only did > >>> the msix/NAPI code path, so a complete solution would need some more > >>> changes. > >>> > >>> Where do we go from here? If this is something that occurs on other > >>> boards would it make sense for the driver to provide a way to turn off > >>> the automasking? (Module parameter perhaps?) > > > > The question becomes why haven't we been able to reproduce this and why > > haven't we seen it before? I'm betting that there is something wrong with > > the MSI-X semantics of either your kernel or the system hardware.
Oops, sorry chris, I always think you're working on PPC. :-) I still don't understand the complete lack of corroborating data. > We saw an rx-side lockup similar to the one Chris reports a couple > months back, and reported it on this list: > > http://sourceforge.net/mailarchive/message.php?msg_name=4C361F3F.7000604%40vyatta.com > > Not sure if it is exactly the same problem, but the symptoms are the > quite similar (rx side stops, tx side keeps working, link stays up). I would be convinced if you could show me that the EIMS register shows 1 or more of the queue bits disabled. > We saw the problem in the field, but had a great deal of difficulty > reproducing this problem in the lab. We finally found a combination of > load from an analyzer, a particular traffic mix and link flaps that > triggers the symptom reliably in 10 or 20 minutes. Haven't yet had a > chance to try Chris' patch. But I'll try it and report the results. the problem induced via link flaps is similar to something else we are looking at, and I don't think it is related to Chris' problem (not 100% sure). > (BTW, I'm testing on a Supermicro platform with 2 x Xeon x5570 CPUs, > 2.6.32 kernel, and version 2.1.4-NAPI of the ixgbe driver.) > > You mentioned in an earlier message that these symptoms could be due to > overrunning the IRQ stack with all the interrupts from all the queues on > this NIC. Is there a way to confirm that this is or isn't happening? > Some statistic perhaps? if you rebuild your kernel with a newer GCC that supports stack canary and CONFIG_CC_STACKPROTECTOR_ALL=y CONFIG_CC_STACKPROTECTOR=y then usually it will catch the stack corruption quickly if it is occuring. In my case to repro that issue I just had to reload the driver several times in a row with 16 or more cpus online, and the first time the watchdog triggered all 16 interrupts simultaneously the stack corruption would occur. As I mentioned the (pretty simple) patch to serialize hard interrupt handlers fixes/works around it. Jesse ------------------------------------------------------------------------------ Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired