On 10/22/2010 4:46 PM, Brandeburg, Jesse wrote:
>
>
> On Fri, 22 Oct 2010, Bob Gilligan wrote:
>> On 10/22/2010 3:41 PM, Brandeburg, Jesse wrote:
>>> On Fri, 22 Oct 2010, Chris Friesen wrote:
>>>> On 10/22/2010 11:06 AM, Chris Friesen wrote:
>>>>> On 10/12/2010 11:08 AM, Chris Friesen wrote:
>>>>>> On 10/08/2010 04:36 PM, Brandeburg, Jesse wrote:
>>>>>>
>>>>>>> seems reasonable, it should work okay.  Does it fix the problem?  It 
>>>>>>> seems
>>>>>>> there must be a race between when the interrupt gets re-enabled and when
>>>>>>> the hardware clears the mask via EIAM on the next interrupt.
>>>>>>>
>>>>>> I'm about to give it a try.  The problem can take hours to reproduce, so
>>>>>> we won't know for a day or so whether it's really gone.
>>>>>>
>>>>> It looks like the attached patch makes our problem go away.  I only did
>>>>> the msix/NAPI code path, so a complete solution would need some more
>>>>> changes.
>>>>>
>>>>> Where do we go from here?  If this is something that occurs on other
>>>>> boards would it make sense for the driver to provide a way to turn off
>>>>> the automasking?  (Module parameter perhaps?)
>>>
>>> The question becomes why haven't we been able to reproduce this and why
>>> haven't we seen it before?  I'm betting that there is something wrong with
>>> the MSI-X semantics of either your kernel or the system hardware.
>
> Oops, sorry chris, I always think you're working on PPC. :-)  I still
> don't understand the complete lack of corroborating data.
>
>> We saw an rx-side lockup similar to the one Chris reports a couple
>> months back, and reported it on this list:
>>
>> http://sourceforge.net/mailarchive/message.php?msg_name=4C361F3F.7000604%40vyatta.com
>>
>> Not sure if it is exactly the same problem, but the symptoms are the
>> quite similar (rx side stops, tx side keeps working, link stays up).
>
> I would be convinced if you could show me that the EIMS register shows 1
> or more of the queue bits disabled.

We usually see all 16 low-order bits set in EIMS when the problem occurs:

0x00880: EIMS        (Extended Interr. Mask Set/Read) 0xD61BFFFF

>
>> We saw the problem in the field, but had a great deal of difficulty
>> reproducing this problem in the lab.  We finally found a combination of
>> load from an analyzer, a particular traffic mix and link flaps that
>> triggers the symptom reliably in 10 or 20 minutes.  Haven't yet had a
>> chance to try Chris' patch.  But I'll try it and report the results.
>
> the problem induced via link flaps is similar to something else we are
> looking at, and I don't think it is related to Chris' problem (not 100%
> sure).

I finally had a chance to try out Chris' patch, and it did NOT fix the 
problem.  I still can get receive-side lockups by flapping the link.  So 
it looks like the problems are indeed different.

>
>> (BTW, I'm testing on a Supermicro platform with 2 x Xeon x5570 CPUs,
>> 2.6.32 kernel, and version 2.1.4-NAPI of the ixgbe driver.)
>>
>> You mentioned in an earlier message that these symptoms could be due to
>> overrunning the IRQ stack with all the interrupts from all the queues on
>> this NIC.  Is there a way to confirm that this is or isn't happening?
>> Some statistic perhaps?
>
> if you rebuild your kernel with a newer GCC that supports stack canary and
> CONFIG_CC_STACKPROTECTOR_ALL=y
> CONFIG_CC_STACKPROTECTOR=y
>
> then usually it will catch the stack corruption quickly if it is occuring.
> In my case to repro that issue I just had to reload the driver several
> times in a row with 16 or more cpus online, and the first time the
> watchdog triggered all 16 interrupts simultaneously the stack corruption
> would occur.  As I mentioned the (pretty simple) patch to serialize hard
> interrupt handlers fixes/works around it.

Thanks for the info.  I re-built our kernel with stackprotector enabled 
and did not get any panics after reproducing the lockup several times. 
So I guess that isn't it either.

Bob.



------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to