Re: [E1000-devel] e1000_clean_tx_irq: Detected Tx Unit Hang

Jesse Brandeburg Wed, 03 Mar 2010 13:46:35 -0800

On Mon, Mar 1, 2010 at 3:37 AM, Thrash Dude <thrash.d...@gmail.com> wrote:
> Seems to be a rather common issue with the e1000 module. I searched the
> archives back to 2005. Plenty of reports, no solutions.


There are some solutions, one of which is to try loading the driver
with TxDescriptorStep=4 TxDescriptors=1024


> The NIC does drop the link, PC does not hang. The link does become active
> again. Wouldn't be such an issue, although this PC is a file server for
> streaming audio and video files exported across nfs and cifs shares. Quite
> an annoying problem to get 55minutes into a movie to have the link die.

for some of the recent times have you been streaming using cifs or
NFS?  what version of NFS?  what client machine /os did you test with?
 What streaming software were you using to play the movie on the
remote machine?


> NOTE: No the link does not die with every movie. This seems to be
> completely random. I can flood the _server_ with 15 incoming connections
> continuously for 30 minutes and there's no problem. Or I can simply ping -
> c4 server and receive a Tx Unit Hang.

so maybe its not actually related to traffic levels?

> Machine specs -
> Slackware x86_64 -current
> Pure Virgin Kernel 2.6.32.8 (have noticed issue with previous kernels)
> 7GB Ram
> AMD RS780
>
> Migrated same card to another machine to rule out +4GB question that is
> always. And another Chipset to test.
> Intel P45, 2GB Ram - same issue

This is actually a promising development because we might actually
have something close to that system here.  What slot did you plug in?
what is the barcode number on your adapter? XXXXXX-XXX.  The other
(bad) option is that since the problem follows the adapter it could be
the adapter.

have you double checked cooling of the NIC?  Do you have another
identical NIC you can try?  You can probably get warranty support for
the one you have, to get a replacement.

> VMware Player is currently installed. Issue presents itself when VMware is
> removed and/or VMware modules are not loaded.
>
> See below for modinfo, dmesg, IRQ's, lspci and some ethtool output
>
> Partial dmesg
> [43503.704198] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [43503.704199]   Tx Queue             <0> [43503.704200]   TDH
>     <c7> [43503.704201]   TDT                  <da> [43503.704201]
> next_to_use          <da> [43503.704202]   next_to_clean        <c8>
> [43503.704202] buffer_info[next_to_clean] [43503.704203]   time_stamp
>     <1029335c6> [43503.704203]   next_to_watch        <c9> [43503.704204]
>  jiffies              <102933c78> [43503.704205]   next_to_watch.status
> <0> [43505.704209] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [43505.704211]   Tx Queue             <0> [43505.704212]   TDH
>     <c7> [43505.704212]   TDT                  <da> [43505.704213]
> next_to_use          <da> [43505.704214]   next_to_clean        <c8>
> [43505.704214] buffer_info[next_to_clean] [43505.704215]   time_stamp
>     <1029335c6> [43505.704215]   next_to_watch        <c9> [43505.704216]
>  jiffies              <102934448> [43505.704216]   next_to_watch.status
> <0> [43507.704182] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [43507.704183]   Tx Queue             <0> [43507.704184]   TDH
>     <c7> [43507.704185]   TDT                  <da> [43507.704185]
> next_to_use          <da> [43507.704186]   next_to_clean        <c8>
> [43507.704186] buffer_info[next_to_clean] [43507.704187]   time_stamp
>     <1029335c6> [43507.704187]   next_to_watch        <c9> [43507.704188]
>  jiffies              <102934c18> [43507.704189]   next_to_watch.status
> <0>

wow, thats a mess, please fix your mail client next time.  What I do
see in the above is is appears to be a legitimate tx hang.  We have
some debug code you can run that can help us diagnose, would you be
able to run that?


> modinfo e1000|grep ^version
> version:        7.3.21-k5-NAPI
>
>
> ethtool eth0
> Settings for eth0:
>        Supported ports: [ TP ]
>        Supported link modes:   10baseT/Half 10baseT/Full
>                                100baseT/Half 100baseT/Full
>                                1000baseT/Full
>        Supports auto-negotiation: Yes
>        Advertised link modes:  10baseT/Half 10baseT/Full
>                                100baseT/Half 100baseT/Full
>                                1000baseT/Full
>        Advertised auto-negotiation: Yes
>        Speed: 1000Mb/s
>        Duplex: Full
>        Port: Twisted Pair
>        PHYAD: 0
>        Transceiver: internal
>        Auto-negotiation: on
>        Supports Wake-on: umbg
>        Wake-on: g
>        Current message level: 0x00000007 (7) Link detected: yes
>
>
>
> ethtool -i eth0
> driver: e1000
> version: 7.3.21-k5-NAPI
> firmware-version: N/A
> bus-info: 0000:02:06.0
>
>
>
> ethtool -g eth0
> Ring parameters for eth0:
> Pre-set maximums:
> RX:             4096
> RX Mini:        0
> RX Jumbo:       0
> TX:             4096
> Current hardware settings:
> RX:             256
> RX Mini:        0
> RX Jumbo:       0
> TX:             256
>
>
>
> ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> udp fragmentation offload: off
> generic segmentation offload: on
>
>
>
>
> lspci -vv
> 02:06.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet
> Controller (rev 05)
>        Subsystem: Intel Corporation PRO/1000 GT Desktop Adapter Control:
>        I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx-
>        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort+ >SERR- <PERR- INTx-

This MAbort+ means that a transaction that the 82541 initiated was
aborted by the chipset for some reason.  It is unusual to see this,
and it co

>        Latency: 64 (63750ns min), Cache Line Size: 4 bytes Interrupt: pin
>        A routed to IRQ 20
>        Region 0: Memory at fdec0000 (32-bit, non-prefetchable)
> [size=128K]
>        Region 1: Memory at fdea0000 (32-bit, non-prefetchable)
> [size=128K]
>        Region 2: I/O ports at df00 [size=64] [virtual] Expansion ROM at
>        fdf00000 [disabled] [size=128K] Capabilities: [dc] Power
>        Management version 2
>                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0
> +,D1-,D2-,D3hot+,D3cold+)
>                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>        Capabilities: [e4] PCI-X non-bridge device
>                Command: DPERE- ERO+ RBC=512 OST=1
>                Status: Dev=00:00.0 64bit- 133MHz- SCD- USC- DC=simple
> DMMRBC=2048 DMOST=1 DMCRS=8 RSCEM- 266MHz- 533MHz-
>        Kernel driver in use: e1000
>        Kernel modules: e1000
>
>
>
>
> cat /proc/interrupts
>           CPU0       CPU1       CPU2       CPU3
>  0:        129          0        374          2   IO-APIC-edge      timer
>  1:          0          0          2          0   IO-APIC-edge      i8042
>  8:          0          0         17          0   IO-APIC-edge      rtc0
>  9:          0          0          0          0   IO-APIC-fasteoi   acpi
>  12:          0          0          4          0   IO-APIC-edge      i8042
>  14:          2         29    1067022       7488   IO-APIC-edge
> pata_atiixp
>  15:          0          0          0          0   IO-APIC-edge
> pata_atiixp
>  16:          2         80    5953112      20785   IO-APIC-fasteoi
> ohci_hcd:usb3, ohci_hcd:usb4, HDA Intel
>  17:          0          0          4          0   IO-APIC-fasteoi
> ehci_hcd:usb1
>  18:          0         80    2809310      15755   IO-APIC-fasteoi
> ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, HDA Intel, nvidia
>  19:          0          0          0          0   IO-APIC-fasteoi
> ehci_hcd:usb2
>  20:         14        566   23527069      92044   IO-APIC-fasteoi   eth0
>  22:          7        215    2493511      18964   IO-APIC-fasteoi
> ahci, firewire_ohci
> NMI:          0          0          0          0   Non-maskable interrupts
> LOC:   66407712   37087457   32777253   29849036   Local timer interrupts
> SPU:          0          0          0          0   Spurious interrupts
> PMI:          0          0          0          0   Performance monitoring
> interrupts
> PND:          0          0          0          0   Performance pending
> work
> RES:   22471181   27170295   14818184   14992262   Rescheduling interrupts
> CAL:     282347     218071     159212     205899   Function call
> interrupts
> TLB:     549650     571693     491287     469560   TLB shootdowns TRM:
>     0          0          0          0   Thermal event interrupts
> THR:          0          0          0          0   Threshold APIC
> interrupts
> MCE:          0          0          0          0   Machine check
> exceptions
> MCP:        376        376        376        376   Machine check polls
> ERR:          0
> MIS:          0
>
>
>
>
> ifconfig eth0
> eth0      Link encap:Ethernet  HWaddr 00:0e:0c:c2:82:04
>          inet addr:192.168.1.109  Bcast:192.168.1.255  Mask:255.255.255.0
>          inet6 addr: fe80::20e:cff:fec2:8204/64 Scope:Link UP BROADCAST
>          RUNNING MULTICAST  MTU:1500  Metric:1 RX packets:21429857
>          errors:58 dropped:446 overruns:0 frame:29 TX packets:16592984

something is definitely strange here, errors, dropped, and frame being
non zero is not normal either.

>          errors:0 dropped:0 overruns:0 carrier:0 collisions:0
>          txqueuelen:1000
>          RX bytes:26437196186 (24.6 GiB)  TX bytes:10820406728 (10.0 GiB)
>
>
>
> If I upgrade to version e1000-8.0.19 Tx Unit Hang appears immediately
> after the e1000 module is loaded.

but does the part work at that point or is it completely dead?

>
> Using ethtool to turn off rx,tx,sg,tso, and gso things appear to work
> better. But - in that case, a $5 r8169 performs just as well.
>
> Full dmesg in next post.

Still waiting for the next post...  please also include the output of
ethtool -S eth0 after the next hang you get.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] e1000_clean_tx_irq: Detected Tx Unit Hang

Reply via email to