Re: [E1000-devel] [e1000e REGRESSION BISECTED] Detected Hardware Unit Hang with 5.0.7

Konstantin Khlebnikov Wed, 17 Apr 2019 10:57:37 -0700



On 16.04.2019 20:12, Alexander Duyck wrote:

On Mon, Apr 15, 2019 at 11:22 AM Joseph Yasi <joe.y...@gmail.com> wrote:


Hello,
I reported a regression that happened after upgrading from 5.0.6 to 5.0.7:
https://bugzilla.kernel.org/show_bug.cgi?id=203175

This is fixed by reverting commit
7f0a3a436e88a71b96694c029f01a9a8eade3d5d e1000e: fix cyclic resets at link
up with active tx. A few others have reported the same hang in bugzilla.

Thanks,
Joe Yasi

dmesg of hang:
[Sat Apr  6 00:12:10 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full
Duplex, Flow Control: Rx/Tx
[Sat Apr  6 00:12:10 2019] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link
becomes ready
[Sat Apr  6 00:12:12 2019] e1000e 0000:00:1f.6 eth0: Detected Hardware Unit
Hang:

                              TDH                  <0>

                              TDT                  <1>

                              next_to_use          <1>

                              next_to_clean        <0>

                            buffer_info[next_to_clean]:

                              time_stamp           <fffba7a7>

                              next_to_watch        <0>
                              jiffies              <fffbb140>
                              next_to_watch.status <0>
                            MAC Status             <40080080>
                            PHY Status             <7949>
                            PHY 1000BASE-T Status  <0>
                            PHY Extended Status    <3000>
                            PCI Status             <10>
[Sat Apr  6 00:12:14 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full
Duplex, Flow Control: Rx/Tx

lspci -vv
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2)
I219-V
         Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0
         Interrupt: pin A routed to IRQ 145
         Region 0: Memory at df400000 (32-bit, non-prefetchable) [size=128K]
         Capabilities: [c8] Power Management version 3
                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                 Address: 00000000fee00518  Data: 0000
         Capabilities: [e0] PCI Advanced Features
                 AFCap: TP+ FLR+
                 AFCtrl: FLR-
                 AFStatus: TP-
         Kernel driver in use: e1000e
         Kernel modules: e1000


So the commit ID you reported doesn't match up to the value in the
kernel. I believe the patch you are talking about is:
commit 0f9e980bf5ee1a97e2e401c846b2af989eb21c61
Author: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
Date:   Mon Jan 14 16:29:30 2019 +0300

     e1000e: fix cyclic resets at link up with active tx

     I'm seeing series of e1000e resets (sometimes endless) at system boot
     if something generates tx traffic at this time. In my case this is
     netconsole who sends message "e1000e 0000:02:00.0: Some CPU C-states
     have been disabled in order to enable jumbo frames" from e1000e itself.
     As result e1000_watchdog_task sees used tx buffer while carrier is off
     and start this reset cycle again.

     [   17.794359] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: None
     [   17.794714] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
     [   22.936455] e1000e 0000:02:00.0 eth1: changing MTU from 1500 to 9000
     [   23.033336] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   26.102364] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: None
     [   27.174495] 8021q: 802.1Q VLAN Support v1.8
     [   27.174513] 8021q: adding VLAN 0 to HW filter on device eth1
     [   30.671724] cgroup: cgroup: disabling cgroup2 socket matching
due to net_prio or net_cls activation
     [   30.898564] netpoll: netconsole: local port 6666
     [   30.898566] netpoll: netconsole: local IPv6 address
2a02:6b8:0:80b:beae:c5ff:fe28:23f8
     [   30.898567] netpoll: netconsole: interface 'eth1'
     [   30.898568] netpoll: netconsole: remote port 6666
     [   30.898568] netpoll: netconsole: remote IPv6 address
2a02:6b8:b000:605c:e61d:2dff:fe03:3790
     [   30.898569] netpoll: netconsole: remote ethernet address
b0:a8:6e:f4:ff:c0
     [   30.917747] console [netcon0] enabled
     [   30.917749] netconsole: network logging started
     [   31.453353] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   34.185730] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   34.321840] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   34.465822] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   34.597423] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   34.745417] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   34.877356] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   35.005441] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   35.157376] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   35.289362] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   35.417441] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
     [   37.790342] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: None

     This patch flushes tx buffers only once when carrier is off
     rather than at each watchdog iteration.

     Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
     Tested-by: Aaron Brown <aaron.f.br...@intel.com>
     Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>

A quick review of the patch shows that it is fundamentally flawed
since all it is doing is moving reset to the path where the link goes
down. However that doesn't even really resolve the original issue
since the complaint was that  the NIC was resetting because netconsole
was queueing packets while the link down. Without the reset the
packets are just going to queue up on the interface and the first time
the interface comes up it will trigger a Tx hang message as has been
seen here.


Not exactly, reset itself adds new packets into tx queue and
this triggers new NIC reset at next watchdog iteration.
Link state stays down because uplink switch reacts with some delay.
And looks like each reset restarts this delay.


I would recommend reverting the above patch and then addressing the
original problem. The question we should be asking is why are we
enqueueing packets on a ring of the device when it doesn't have link?

A better fix might be to remove the netif_start_queue in the e1000e_up
call, replace it with netif_stop_queue in e1000e_open, place a call to
netif_wake_queue just before the netif_carrier_on in the watchdog
task, and to add a call to netif_stop_queue just after the
netif_carrier_off in the watchdog task. That should prevent us from
enqueuing packets on a interface with no link, and would still allow
us to flush packets out if they somehow got by all that and were still
enqueued to the Tx queue.


Yep, this looks like proper solution.


_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] [e1000e REGRESSION BISECTED] Detected Hardware Unit Hang with 5.0.7

Reply via email to