On Fri, 14 Oct 2011 10:04:26 -0700
Flavio Leitner <f...@redhat.com> wrote:

> 
> Hi,
> 
> I got few reports so far that 82571EB models are having the
> "Detected Hardware Unit Hang" issue after upgrading the kernel.
> 
> Further debugging with an instrumented kernel revealed that the
> socket buffer time stamp matches with the last time e1000_xmit_frame()
> was called. Also that the time stamp of e1000_clean_tx_irq() last run
> is prior to the one in socket buffer.
> 
> However, ~1 second later, an interrupt is fired and the old entry
> is found. Sometimes, the scheduled print_hang_task dumps the
> information _after_ the old entry is sent (shows empty ring),
> indicating that the HW TX unit isn't really stuck and apparently
> just missed the signal to initiate the transmission.
> 
> Order of events:
>  (1) skb is pushed down
>  (2) e1000_xmit_frame() is called
>  (3) ring is filled with one entry
>  (4) TDT is updated
> >(5) nothing happens for little more than 1 second
>  (6) interrupt is fired
>  (7) e1000_clean_tx_irq() is called
>  (8) finds the entry not ready with an old time stamp,
>      schedules print_hang_task and stops the TX queue.
>  (9) print_hang_task runs, dump the info but the old entry is now sent
> (10) apparently the TX queue is back.

Flavio, thanks for the detailed info, please be sure to supply us the
bugzilla number.

TDH is probably not moving due to the writeback threshold settings in
TXDCTL.  netperf UDP_RR test is likely a good way to test this.

I don't think the sequence is quite what you said.  We are going to
work with the hardware team to get a sequence that works right, and we
should have a fix for you soon.

> 
> The following commit seems to be related to the symptoms seen above:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3a3b75860527a11ba5035c6aa576079245d09e2a
> 
>  From: Jesse Brandeburg <jesse.brandeb...@intel.com>
>  Date: Wed, 29 Sep 2010 21:38:49 +0000 (+0000)
>  Subject: e1000e: use hardware writeback batching
>  X-Git-Tag: v2.6.37-rc1~147^2~299
>  X-Git-Url:
> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=3a3b75860527a11ba5035c6aa576079245d09e2a
>  
> 
>  e1000e: use hardware writeback batching
> 
>  Most e1000e parts support batching writebacks.  The problem with this is
>  that when some of the TADV or TIDV timers are not set, Tx can sit forever.
> 
>  This is solved in this patch with write flushes using the Flush Partial
>  Descriptors (FPD) bit in TIDV and RDTR.
> 
>  This improves bus utilization and removes partial writes on e1000e,
>  particularly from 82571 parts in S5500 chipset based machines.
> 
>  Only ES2LAN and 82571/2 parts are included in this optimization, to reduce
>  testing load.
> 
> We have modified the instrumented kernel to include the following patch
> disabling writeback batching feature to narrow down the problem:
> 
> --- debug/drivers/net/e1000e/82571.c.orig      2011-10-11 14:00:44.000000000
> -0300
> +++ debug/drivers/net/e1000e/82571.c   2011-10-11 15:02:51.000000000 -0300
> @@ -2028,8 +2028,7 @@ struct e1000_info e1000_82571_info = {
>                                  | FLAG_RESET_OVERWRITES_LAA /* errata */
>                                  | FLAG_TARC_SPEED_MODE_BIT /* errata */
>                                  | FLAG_APME_CHECK_PORT_B,
> -      .flags2                 = FLAG2_DISABLE_ASPM_L1 /* errata 13 */
> -                                | FLAG2_DMA_BURST,
> +      .flags2                 = FLAG2_DISABLE_ASPM_L1, /* errata 13 */
>        .pba                    = 38,
>        .max_hw_frame_size      = DEFAULT_JUMBO,
> 
> 
> and the customer confirmed that the issue has disappeared since then.
> 
> Board info:
> 1e:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
> Controller (Copper) (rev 06)
> 
> 1e:00.0 0200: 8086:10bc (rev 06)
>         Subsystem: 103c:704b
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
> Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort-
> <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Interrupt: pin B routed to IRQ 224
>         Region 0: Memory at fd4e0000 (32-bit, non-prefetchable) [size=128K]
>         Region 1: Memory at fd400000 (32-bit, non-prefetchable) [size=512K]
>         Region 2: I/O ports at 7000 [size=32]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                 Address: 00000000fee00000  Data: 4073
>         Capabilities: [e0] Express (v1) Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns,
> L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+
> Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 256 bytes, MaxReadReq 4096 bytes
>                 DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr+
> TransPend-
>                 LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0
> <4us, L1 <64us
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- 
> CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>         Kernel driver in use: e1000e
>         Kernel modules: e1000e
> 
> 
> I am checking the NIC dev specs again but so far I couldn't find
> a reason for this to happen yet.
> 
> Any ideas? It happens with 5.7 kernel (2.6.18-274.el5) and it seems
> to be happening with 6.1 as well, though I am waiting the instrumented
> kernel outputs to confirm.
> 
> This is related to 
> https://bugzilla.redhat.com/show_bug.cgi?id=746272
> 
> thanks in advance!
> fbl
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> _______________________________________________
> E1000-devel mailing list
> E1000-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/e1000-devel
> To learn more about Intel&#174; Ethernet, visit 
> http://communities.intel.com/community/wired


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to