For the records:

The  mentioned patch did not make any difference. The 82559EB 10G card stopped 
accepting any data even with this patch after 3 days of constant (but not too 
high) load.

So we moved to the newest driver release 3.4.24-NAPI and had the system run for 
6 days without an issue.
I really don't like this situation, having no idea what actually goes wrong. 
But we asked our customer to give the 2.6.32 kernel with this driver a try, 
because he sees the issue far more often than we do.

I intentionally crashed the kernel to get a core dump in the error situation 
but could not see anything obvious in the driver structures.

If the issue does not happen again on customer side we'll switch to the new 
version and cross our fingers that it does not break anything else.

Cheers,
Martin

-----Original Message-----
From: Skidmore, Donald C [mailto:[email protected]]
Sent: Donnerstag, 28. Juli 2011 19:56
To: Zielinski, Martin; Kirsher, Jeffrey T
Cc: [email protected]
Subject: RE: [E1000-devel] ixgbe: not accepting any packets - increasing 
rx_missed_errors

Hey Martin,

Sorry about you troubles with ixgbe, I think there is a fair chance you're 
seeing the issue you referenced in the other email.  The patch I'm referring to 
is a7f5a5fcd9f13afd3471a0de8c1fdaa8f989497c.

When we saw this issue it was primarily on short DA cables (1m - 3m) and had to 
do with a FW/SW semaphore collision at link time.  We were later able to 
recreate the failure on longer DA cables and fiber but it happened at a much 
slower rate.

I'll try to address your questions below:

>> I am aware that this is an old driver version, but please give me a
>chance to explain why I'm asking for information anyway:
>>
>> - The driver is part of the 2.6.32 stable branch.

We try to back port to stable branches for security issue or very critical 
failures.  But other than that we don't activity push patches to these 
branches.  This is possible as our current out of tree driver (on source forge) 
works with a wide range of kernels back to 2.4.x time frame.  As well various 
distro back port our upstream patches as they apply to their older kernels.  
This seems to cover most people's needs, although it sounds like it might not 
work for you.

>> - It takes 2 - 10 days to reproduce it in the lab. So if we use a
>newer version, we cannot be sure that the problem is fixed just because
>we don't see it anymore.

The problem I mentioned above was like that.  Very difficult to recreate even 
with special tests scripts we narrowed down to do it (tight loops on bring link 
up and down, verifying each step).  It at times took as long as 53 hours.  
After this fix we ran several machines for over a week with no failures.  
Likewise I've had other people in the community hit this failure and upgrading 
to a new driver seemed to fix their problem.  I might be able to dig up the 
Perl script I wrote to speed up the failure if you're interested.

>> - According to the customer the issue started with an update that adds
>the memory boundary and disables packet split (errata #45). PSRTYPE
>register is not initialized in this version. Everything in the previous
>version worked (so with the even older driver).

This shouldn't be related to the patch I referenced above but we haven't tested 
specifically for it as the fix was never back ported to this branch.  But the 
fix you're talking about for (erratum #45) went into the all the stables as 
well as net-next.  We are continually validating on next-next and haven't seen 
this failure there.

>> - It is a critical customer. If we provide a new version and it fails
>again this will become a problem.

I can understand that.  But we only test longterm kernel drivers as we add 
patches to them, which I mentioned above aren't all that often.  More focus 
from out of tree driver making sure it plays well with older kernels.  It just 
comes down to using our limited resources where we get the most gain.

>> - All reports about this issue end up without resolution or the advice
>to update the driver. I really tried to extract an explanation or the
>exact changeset that fixes the issue. But I failed. So for documentation
>purposes it would be a good thing to make the solution googleble.

We advised to update the driver as that is where the fix was put in.  If you 
want to see a list of the patches git would be a good place to start.  
Everything is there although it might be a bit overwhelming as there are 
probably around 1000 patches.

Hope this helps,
-Don Skidmore <[email protected]>


>-----Original Message-----
>From: [email protected] [mailto:[email protected]]
>Sent: Thursday, July 28, 2011 6:20 AM
>To: Kirsher, Jeffrey T; Skidmore, Donald C
>Cc: [email protected]
>Subject: RE: [E1000-devel] ixgbe: not accepting any packets - increasing
>rx_missed_errors
>
>Thanks Jeff,
>
>I don't think I was in contact with Don before.
>Shall I send the requested information directly? I don't think a large
>attachment will go to the list.
>
>To not only spam the list here are some information about the system:
>
>Kernel: 2.6.32.36
>
>ethtool -i eth4
>driver: ixgbe
>version: 2.0.44-k2
>firmware-version: 1.8-0
>bus-info: 0000:0a:00.0
>
>lspci -vvv is very large but at the error state no traffic is accepted.
>So the PCIe speed as mentioned in the datasheet is no limiting factor
>here.
>
>0a:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
>Network Connection (rev 01)
>        Subsystem: Unknown device 1b6d:00a0
>        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>ParErr- Stepping- SERR- FastB2B-
>        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
><TAbort- <MAbort- >SERR- <PERR-
>        Latency: 0, Cache Line Size: 64 bytes
>        Interrupt: pin A routed to IRQ 40
>        Region 0: Memory at df5c0000 (64-bit, non-prefetchable)
>[size=128K]
>        Region 2: I/O ports at ecc0 [size=32]
>        Region 4: Memory at df5b8000 (64-bit, non-prefetchable)
>[size=16K]
>        Capabilities: [40] Power Management version 3
>                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-
>,D2-,D3hot+,D3cold-)
>                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
>        Capabilities: [50] Message Signalled Interrupts: 64bit+
>Queue=0/0 Enable-
>                Address: 0000000000000000  Data: 0000
>        Capabilities: [70] MSI-X: Enable+ Mask- TabSize=64
>                Vector table: BAR=4 offset=00000000
>                PBA: BAR=4 offset=00002000
>        Capabilities: [a0] Express Endpoint IRQ 0
>                Device: Supported: MaxPayload 512 bytes, PhantFunc 0,
>ExtTag-
>                Device: Latency L0s <512ns, L1 <64us
>                Device: AtnBtn- AtnInd- PwrInd-
>                Device: Errors: Correctable+ Non-Fatal+ Fatal+
>Unsupported+
>                Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                Device: MaxPayload 256 bytes, MaxReadReq 512 bytes
>                Link: Supported Speed unknown, Width x8, ASPM L0s, Port
>4
>                Link: Latency L0s unlimited, L1 <32us
>                Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
>                Link: Speed unknown, Width x8
>        Capabilities: [e0] Vital Product Data
>        Capabilities: [100] Advanced Error Reporting
>        Capabilities: [140] Device Serial Number 00-00-00-ff-ff-00-00-00
>        Capabilities: [150] Unknown (14)
>        Capabilities: [160] Unknown (16)
>
>
>ethtool -d output twice a few seconds one after the other (only changes
>and not TX)
>
>0x00048: FRTIMER     (Free Running Timer)             0x55AE1845
>0x5613A01A
>0x03FA0: mpc0        (Missed Packets Count 0)         0x0000511C
>0x00005124
>0x0405C: prc64       (Packets Received (64B) Count)   0x000172DF
>0x000172E7
>0x04078: bprc        (Broadcast Packets Rx Count)     0x0000239E
>0x000023A3
>0x0407C: mprc        (Multicast Packets Rx Count)     0x00015BAE
>0x00015BB1
>0x04088: gorcl       (Good Octets Rx Count Low)       0x6882B90F
>0x6882BB0F
>0x040C0: torl        (Total Octets Rx Count Low)      0x688B88FF
>0x688B8AFF
>0x040D0: tpr         (Total Packets Received)         0x0D6326BB
>0x0D6326C3
>
>Neither the Receive Descriptor Head nor Tail register changes.
>
>dmesg: Nothing
>
>Cheers,
>Martin
>
>-----Original Message-----
>From: [email protected] [mailto:[email protected]] On Behalf Of Jeff
>Kirsher
>Sent: Donnerstag, 28. Juli 2011 13:17
>To: Zielinski, Martin; Don Skidmore
>Cc: [email protected]
>Subject: Re: [E1000-devel] ixgbe: not accepting any packets - increasing
>rx_missed_errors
>
>On Thu, Jul 28, 2011 at 01:41,  <[email protected]> wrote:
>> Hello,
>>
>> With a 82559EB card a customer often comes into the situation that no
>packet can be received anymore until network restart.
>> The symptomm is that the rx_missed_errors register counts each packet
>but no more packets can be seen by the kernel.
>>
>> We are using a 2.6.32 kernel with version: 2.0.44-k2.
>>
>> I am aware that this is an old driver version, but please give me a
>chance to explain why I'm asking for information anyway:
>>
>> - The driver is part of the 2.6.32 stable branch.
>> - It takes 2 - 10 days to reproduce it in the lab. So if we use a
>newer version, we cannot be sure that the problem is fixed just because
>we don't see it anymore.
>> - According to the customer the issue started with an update that adds
>the memory boundary and disables packet split (errata #45). PSRTYPE
>register is not initialized in this version. Everything in the previous
>version worked (so with the even older driver).
>> - It is a critical customer. If we provide a new version and it fails
>again this will become a problem.
>> - All reports about this issue end up without resolution or the advice
>to update the driver. I really tried to extract an explanation or the
>exact changeset that fixes the issue. But I failed. So for documentation
>purposes it would be a good thing to make the solution googleble.
>>
>> Don Skidmore wrote in:
>>
>>
>http://sourceforge.net/mailarchive/forum.php?thread_name=29F4ED941D916B4
>8B88B4D2A4F3D1B9C01D2E285AF%40orsmsx509.amr.corp.intel.com&forum_name=e1
>000-devel
>>
>> "Have you tried using the latest Source Forge driver (3.2.9).
>Including in it was a fix that corrected an erratum that sounds very
>similar to your issue."
>>
>> I'd greatly appreciate if someone can point me to the right direction.
>What I'd like to understand is:
>>
>
>Don seems to be have been working with you, so I will let him continue
>in assisting you (since he is the ixgbe Maintainer).
>There have been 15 more recent out-of-tree driver release's since the
>you are using, so it is very possible that the issue you are seeing
>was fixed later on in one of the more recent driver releases, and the
>fix was not back-ported to the older 2.6.32 kernel.  If Don does not
>have the information already, any information that you can provide
>(i.e. kernel config, lspci -vvv output. dmesg log with the error's you
>are seeing).  This information can help us greatly in determining what
>fixes that were implemented in later versions of the driver would have
>an effect on the issue you are seeing.  Once we narrow down the
>fix(es) that resolve the issue, then we can provide the additional
>information on what the exact change is and why.
>
>With some (not all) fixes, we should have testing scenarios which
>would consistently reproduce the issue, so that we can accurately
>determine if the fix(es) resolved the issue.  I know that I am
>speaking in generalities and nothing specific, this is mainly because
>I do not the exact issue you are having the the possible fixes that
>Don is aware of.
>
>I have added Don to this email thread, and will let him work with you
>to get the specifics on the issue(s) you are seeing.  So that we can
>work on getting a resolution to you, whether it be an updated driver
>or a patchset against your kernel.
>
>Cheers,
>Jeff
>
>> - What change exactly is the fix for this issue?
>> - How can I verify that I am seeing the same issue (some special
>register/memory dump/...)?
>> - How can I verify that the issue is fixed.
>>
>> I know - I'm asking for support for a driver that is part of the
>stable kernel but very old in your development line.
>> So I would be even happier if someone takes the time to answer my
>questions.
>>
>> Cheers,
>> Martin
>>
>> Martin Zielinski
>> Dipl. Inform
>> Senior Engineer
>>
>> McAfee GmbH
>>
>> Firmensitz:     Muenchen
>> Amtsgericht:     AG Muenchen
>> Handelsregister:   HRB 144340
>> Geschaeftsfuehrer: Emmet Russell, Keith Krzeminski, Douglas Rice
>> Bankverbindung:   ABN-Amro Bank N.V. Konto 671 211 9006
>> UST-ID:   DE168122444
>>
>> ----------------------------------------------------------------------
>--------
>> Got Input?   Slashdot Needs You.
>> Take our quick survey online.  Come on, we don't ask for help often.
>> Plus, you'll get a chance to win $100 to spend on ThinkGeek.
>> http://p.sf.net/sfu/slashdot-survey
>> _______________________________________________
>> E1000-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/e1000-devel
>> To learn more about Intel&#174; Ethernet, visit
>http://communities.intel.com/community/wired
>>
>
>
>
>--
>Cheers,
>Jeff
>
>Firmensitz:     Muenchen
>Amtsgericht:     AG Muenchen
>Handelsregister:   HRB 144340
>Geschaeftsfuehrer: Emmet Russell, Keith Krzeminski, Douglas Rice
>Bankverbindung:   ABN-Amro Bank N.V. Konto 671 211 9006
>UST-ID:   DE168122444

Firmensitz:     Muenchen
Amtsgericht:     AG Muenchen
Handelsregister:   HRB 144340
Geschaeftsfuehrer: Emmet Russell, Keith Krzeminski, Douglas Rice
Bankverbindung:   ABN-Amro Bank N.V. Konto 671 211 9006
UST-ID:   DE168122444
------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system, 
user administration capabilities and model configuration. Take 
the hassle out of deploying and managing Subversion and the 
tools developers use with it. 
http://p.sf.net/sfu/wandisco-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to