[E1000-devel] Detected Hardware Unit Hang

Zoltán Halassy Fri, 20 Mar 2015 02:30:42 -0700

I manage a Fujitsu Primergy TX100 S2 server, which has this integrated NIC:


00:19.0 Ethernet controller [0200]: Intel Corporation 82578DM Gigabit
Network Connection [8086:10ef] (rev 05)
        Subsystem: Fujitsu Technology Solutions Device [1734:11a6]
        Kernel driver in use: e1000e

Using a kernel as dom0 3.17.7 over xen 4.5.0

If the NIC is connected to a 100Mb/s link or forced to negotiate
100Mb/s (via ethtool advertise 0x008) it works properly. Kerdel dmesg
shows this:

[    8.868197] e1000e: Intel(R) PRO/1000 Network Driver - 3.1.0.2-NAPI
[    8.868199] e1000e: Copyright(c) 1999 - 2014 Intel Corporation.
[    8.868412] e1000e 0000:00:19.0: Interrupt Throttling Rate
(ints/sec) set to dynamic conservative mode
[    9.115322] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width
x1) 00:19:99:a7:f8:51
[    9.115325] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[    9.115359] e1000e 0000:00:19.0 eth0: MAC: 9, PHY: 9, PBA No: 313130-031
[    9.131477] e1000e 0000:00:19.0 enp0s25: renamed from eth0
[   29.669862] e1000e: enp0s25 NIC Link is Up 100 Mbps Full Duplex,
Flow Control: Rx
[   29.669972] e1000e 0000:00:19.0 enp0s25: 10/100 speed: disabling TSO

However if it's connected to a 1Gb/s device, these messages appear first:

[   82.149105] e1000e: Intel(R) PRO/1000 Network Driver - 3.1.0.2-NAPI
[   82.149108] e1000e: Copyright(c) 1999 - 2014 Intel Corporation.
[   82.149312] e1000e 0000:00:19.0: Interrupt Throttling Rate
(ints/sec) set to dynamic conservative mode
[   82.396102] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width
x1) 00:19:99:a7:f8:51
[   82.396105] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[   82.396139] e1000e 0000:00:19.0 eth0: MAC: 9, PHY: 9, PBA No: 313130-031
[   82.414029] e1000e 0000:00:19.0 enp0s25: renamed from eth0
[   93.410124] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: Rx

Now, if I download something with high throughput (inbound traffic,
say, 120Mb/s), then it works properly too. However if I try to upload
something with high throughput (outbound traffic, the receiving end is
willing to accept around ~25Mb/s), the connection hangs for a few
seconds and these messages appear in dmesg:

[  155.601937] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdbcbc>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  157.602036] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdc48c>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  159.601880] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdcc5c>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  161.601989] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdd42c>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  163.602096] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffddbfc>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  163.605665] ------------[ cut here ]------------
[  163.605677] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264
dev_watchdog+0x22c/0x240()
[  163.605680] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
[  163.605682] Modules linked in: e1000e(O)
[  163.605690] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
3.17.7-hardened-r1-uther #3
[  163.605693] Hardware name: FUJITSU
PRIMERGY TX100 S2             /D2779, BIOS 6.00 Rev. 1.07.2779.A1
     04/29/2011
[  163.605695]  0000000000000009 ffffffff8183f2a9 ffff88016f203e60
ffffffff8106bb2d
[  163.605699]  0000000000000000 ffff88016f203eb0 0000000000000001
0000000000000000
[  163.605703]  0000000000000000 ffffffff8106bb97 ffffffff81ac7c18
0000000000000030
[  163.605707] Call Trace:
[  163.605710]  <IRQ>  [<ffffffff8183f2a9>] ? dump_stack+0x41/0x51
[  163.605728]  [<ffffffff8106bb2d>] ? warn_slowpath_common+0x6d/0x90
[  163.605730]  [<ffffffff8106bb97>] ? warn_slowpath_fmt+0x47/0x50
[  163.605733]  [<ffffffff81505ed2>] ? add_interrupt_randomness+0x32/0x1e0
[  163.605735]  [<ffffffff816d120c>] ? dev_watchdog+0x22c/0x240
[  163.605737]  [<ffffffff816d0fe0>] ? dev_graft_qdisc+0x70/0x70
[  163.605741]  [<ffffffff810ac232>] ? call_timer_fn.isra.36+0x12/0x70
[  163.605744]  [<ffffffff810ac440>] ? run_timer_softirq+0x1b0/0x240
[  163.605746]  [<ffffffff8106eb3b>] ? __do_softirq+0xdb/0x200
[  163.605748]  [<ffffffff8106ee3d>] ? irq_exit+0x4d/0x60
[  163.605752]  [<ffffffff814d5cef>] ? xen_evtchn_do_upcall+0x2f/0x40
[  163.605755]  [<ffffffff818478fe>] ? xen_do_hypervisor_callback+0x1e/0x30
[  163.605756]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[  163.605761]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[  163.605764]  [<ffffffff8100764c>] ? xen_safe_halt+0xc/0x20
[  163.605767]  [<ffffffff81014775>] ? default_idle+0x5/0x10
[  163.605770]  [<ffffffff810989e7>] ? cpu_startup_entry+0x217/0x260
[  163.605772]  [<ffffffff81bdcea5>] ? 0xffffffff81bdcea5
[  163.605774]  [<ffffffff81bdc8c8>] ? 0xffffffff81bdc8c8
[  163.605775]  [<ffffffff81be011e>] ? 0xffffffff81be011e
[  163.605777] ---[ end trace 7d81642d805c09bf ]---
[  163.605785] e1000e 0000:00:19.0 enp0s25: Reset adapter unexpectedly
[  166.432861] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: Rx

This happens with the vanilla bundled e1000e driver from 3.17.7 and
with the 3.1.0.2 driver downloaded from intel.com too both with and
without CFLAGS_EXTRA=-DDISABLE_PCI_MSI. The same problem was with 3.10
and 3.8 kernels. I don't know if there was a functional driver becasue
the problem appeared only when we upgraded our switch to a 1Gb/s one,
and at that time we already had the 3.9 kernel. I was hoping newer
releases will fix this eventually as I found some similar problems on
the net. But no luck yet.

If I let the NIC negotiate 1Gb/s, tried with "ethtool gso off gro off
tso off". The kernel log remains silent, but other problems appear:
reaching the server with ssh over this NIC from the outside show a lot
of latency (up to 5000ms), but when I ping the server periodically (1s
intervals), the server response gets better (~1000ms). When the server
tries to download something, this problem does not appear (download
speed reaches 120Mb/s from the Internet, the cap of the ISP).

Should I attach my kernel config? Nothing fancy there. MSI support is
compiled into the kernel. Only the E1000E driver is enabled as module,
the other Intel modules are disabled.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

[E1000-devel] Detected Hardware Unit Hang

Reply via email to