On 10/22/2010 3:41 PM, Brandeburg, Jesse wrote: > > > On Fri, 22 Oct 2010, Chris Friesen wrote: > >> On 10/22/2010 11:06 AM, Chris Friesen wrote: >>> On 10/12/2010 11:08 AM, Chris Friesen wrote: >>> >>>> On 10/08/2010 04:36 PM, Brandeburg, Jesse wrote: >>>> >>>>> seems reasonable, it should work okay. Does it fix the problem? It seems >>>>> there must be a race between when the interrupt gets re-enabled and when >>>>> the hardware clears the mask via EIAM on the next interrupt. >>>>> >>>> I'm about to give it a try. The problem can take hours to reproduce, so >>>> we won't know for a day or so whether it's really gone. >>>> >>> It looks like the attached patch makes our problem go away. I only did >>> the msix/NAPI code path, so a complete solution would need some more >>> changes. >>> >>> Where do we go from here? If this is something that occurs on other >>> boards would it make sense for the driver to provide a way to turn off >>> the automasking? (Module parameter perhaps?) > > The question becomes why haven't we been able to reproduce this and why > haven't we seen it before? I'm betting that there is something wrong with > the MSI-X semantics of either your kernel or the system hardware.
We saw an rx-side lockup similar to the one Chris reports a couple months back, and reported it on this list: http://sourceforge.net/mailarchive/message.php?msg_name=4C361F3F.7000604%40vyatta.com Not sure if it is exactly the same problem, but the symptoms are the quite similar (rx side stops, tx side keeps working, link stays up). We saw the problem in the field, but had a great deal of difficulty reproducing this problem in the lab. We finally found a combination of load from an analyzer, a particular traffic mix and link flaps that triggers the symptom reliably in 10 or 20 minutes. Haven't yet had a chance to try Chris' patch. But I'll try it and report the results. (BTW, I'm testing on a Supermicro platform with 2 x Xeon x5570 CPUs, 2.6.32 kernel, and version 2.1.4-NAPI of the ixgbe driver.) You mentioned in an earlier message that these symptoms could be due to overrunning the IRQ stack with all the interrupts from all the queues on this NIC. Is there a way to confirm that this is or isn't happening? Some statistic perhaps? Bob. ------------------------------------------------------------------------------ Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired