On 10/22/2010 4:46 PM, Brandeburg, Jesse wrote: > > > On Fri, 22 Oct 2010, Bob Gilligan wrote: >> On 10/22/2010 3:41 PM, Brandeburg, Jesse wrote: >>> On Fri, 22 Oct 2010, Chris Friesen wrote: >>>> On 10/22/2010 11:06 AM, Chris Friesen wrote: >>>>> On 10/12/2010 11:08 AM, Chris Friesen wrote: >>>>>> On 10/08/2010 04:36 PM, Brandeburg, Jesse wrote: >>>>>> >>>>>>> seems reasonable, it should work okay. Does it fix the problem? It >>>>>>> seems >>>>>>> there must be a race between when the interrupt gets re-enabled and when >>>>>>> the hardware clears the mask via EIAM on the next interrupt. >>>>>>> >>>>>> I'm about to give it a try. The problem can take hours to reproduce, so >>>>>> we won't know for a day or so whether it's really gone. >>>>>> >>>>> It looks like the attached patch makes our problem go away. I only did >>>>> the msix/NAPI code path, so a complete solution would need some more >>>>> changes. >>>>> >>>>> Where do we go from here? If this is something that occurs on other >>>>> boards would it make sense for the driver to provide a way to turn off >>>>> the automasking? (Module parameter perhaps?) >>> >>> The question becomes why haven't we been able to reproduce this and why >>> haven't we seen it before? I'm betting that there is something wrong with >>> the MSI-X semantics of either your kernel or the system hardware. > > Oops, sorry chris, I always think you're working on PPC. :-) I still > don't understand the complete lack of corroborating data. > >> We saw an rx-side lockup similar to the one Chris reports a couple >> months back, and reported it on this list: >> >> http://sourceforge.net/mailarchive/message.php?msg_name=4C361F3F.7000604%40vyatta.com >> >> Not sure if it is exactly the same problem, but the symptoms are the >> quite similar (rx side stops, tx side keeps working, link stays up). > > I would be convinced if you could show me that the EIMS register shows 1 > or more of the queue bits disabled.
We usually see all 16 low-order bits set in EIMS when the problem occurs: 0x00880: EIMS (Extended Interr. Mask Set/Read) 0xD61BFFFF > >> We saw the problem in the field, but had a great deal of difficulty >> reproducing this problem in the lab. We finally found a combination of >> load from an analyzer, a particular traffic mix and link flaps that >> triggers the symptom reliably in 10 or 20 minutes. Haven't yet had a >> chance to try Chris' patch. But I'll try it and report the results. > > the problem induced via link flaps is similar to something else we are > looking at, and I don't think it is related to Chris' problem (not 100% > sure). I finally had a chance to try out Chris' patch, and it did NOT fix the problem. I still can get receive-side lockups by flapping the link. So it looks like the problems are indeed different. > >> (BTW, I'm testing on a Supermicro platform with 2 x Xeon x5570 CPUs, >> 2.6.32 kernel, and version 2.1.4-NAPI of the ixgbe driver.) >> >> You mentioned in an earlier message that these symptoms could be due to >> overrunning the IRQ stack with all the interrupts from all the queues on >> this NIC. Is there a way to confirm that this is or isn't happening? >> Some statistic perhaps? > > if you rebuild your kernel with a newer GCC that supports stack canary and > CONFIG_CC_STACKPROTECTOR_ALL=y > CONFIG_CC_STACKPROTECTOR=y > > then usually it will catch the stack corruption quickly if it is occuring. > In my case to repro that issue I just had to reload the driver several > times in a row with 16 or more cpus online, and the first time the > watchdog triggered all 16 interrupts simultaneously the stack corruption > would occur. As I mentioned the (pretty simple) patch to serialize hard > interrupt handlers fixes/works around it. Thanks for the info. I re-built our kernel with stackprotector enabled and did not get any panics after reproducing the lockup several times. So I guess that isn't it either. Bob. ------------------------------------------------------------------------------ Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired