Re: [E1000-devel] [e1000e REGRESSION BISECTED] Detected Hardware Unit Hang with 5.0.7

Alexander Duyck Tue, 16 Apr 2019 10:13:30 -0700

On Mon, Apr 15, 2019 at 11:22 AM Joseph Yasi <[email protected]> wrote:
>
> Hello,
> I reported a regression that happened after upgrading from 5.0.6 to 5.0.7:
> https://bugzilla.kernel.org/show_bug.cgi?id=203175
>
> This is fixed by reverting commit
> 7f0a3a436e88a71b96694c029f01a9a8eade3d5d e1000e: fix cyclic resets at link
> up with active tx. A few others have reported the same hang in bugzilla.
>
> Thanks,
> Joe Yasi
>
> dmesg of hang:
> [Sat Apr  6 00:12:10 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full
> Duplex, Flow Control: Rx/Tx
> [Sat Apr  6 00:12:10 2019] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link
> becomes ready
> [Sat Apr  6 00:12:12 2019] e1000e 0000:00:1f.6 eth0: Detected Hardware Unit
> Hang:
>
>                              TDH                  <0>
>
>                              TDT                  <1>
>
>                              next_to_use          <1>
>
>                              next_to_clean        <0>
>
>                            buffer_info[next_to_clean]:
>
>                              time_stamp           <fffba7a7>
>
>                              next_to_watch        <0>
>                              jiffies              <fffbb140>
>                              next_to_watch.status <0>
>                            MAC Status             <40080080>
>                            PHY Status             <7949>
>                            PHY 1000BASE-T Status  <0>
>                            PHY Extended Status    <3000>
>                            PCI Status             <10>
> [Sat Apr  6 00:12:14 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full
> Duplex, Flow Control: Rx/Tx
>
> lspci -vv
> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2)
> I219-V
>         Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 145
>         Region 0: Memory at df400000 (32-bit, non-prefetchable) [size=128K]
>         Capabilities: [c8] Power Management version 3
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                 Address: 00000000fee00518  Data: 0000
>         Capabilities: [e0] PCI Advanced Features
>                 AFCap: TP+ FLR+
>                 AFCtrl: FLR-
>                 AFStatus: TP-
>         Kernel driver in use: e1000e
>         Kernel modules: e1000
>


So the commit ID you reported doesn't match up to the value in the
kernel. I believe the patch you are talking about is:
commit 0f9e980bf5ee1a97e2e401c846b2af989eb21c61
Author: Konstantin Khlebnikov <[email protected]>
Date:   Mon Jan 14 16:29:30 2019 +0300

    e1000e: fix cyclic resets at link up with active tx

    I'm seeing series of e1000e resets (sometimes endless) at system boot
    if something generates tx traffic at this time. In my case this is
    netconsole who sends message "e1000e 0000:02:00.0: Some CPU C-states
    have been disabled in order to enable jumbo frames" from e1000e itself.
    As result e1000_watchdog_task sees used tx buffer while carrier is off
    and start this reset cycle again.

    [   17.794359] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: None
    [   17.794714] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    [   22.936455] e1000e 0000:02:00.0 eth1: changing MTU from 1500 to 9000
    [   23.033336] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   26.102364] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: None
    [   27.174495] 8021q: 802.1Q VLAN Support v1.8
    [   27.174513] 8021q: adding VLAN 0 to HW filter on device eth1
    [   30.671724] cgroup: cgroup: disabling cgroup2 socket matching
due to net_prio or net_cls activation
    [   30.898564] netpoll: netconsole: local port 6666
    [   30.898566] netpoll: netconsole: local IPv6 address
2a02:6b8:0:80b:beae:c5ff:fe28:23f8
    [   30.898567] netpoll: netconsole: interface 'eth1'
    [   30.898568] netpoll: netconsole: remote port 6666
    [   30.898568] netpoll: netconsole: remote IPv6 address
2a02:6b8:b000:605c:e61d:2dff:fe03:3790
    [   30.898569] netpoll: netconsole: remote ethernet address
b0:a8:6e:f4:ff:c0
    [   30.917747] console [netcon0] enabled
    [   30.917749] netconsole: network logging started
    [   31.453353] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   34.185730] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   34.321840] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   34.465822] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   34.597423] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   34.745417] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   34.877356] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   35.005441] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   35.157376] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   35.289362] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   35.417441] e1000e 0000:02:00.0: Some CPU C-states have been
disabled in order to enable jumbo frames
    [   37.790342] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: None

    This patch flushes tx buffers only once when carrier is off
    rather than at each watchdog iteration.

    Signed-off-by: Konstantin Khlebnikov <[email protected]>
    Tested-by: Aaron Brown <[email protected]>
    Signed-off-by: Jeff Kirsher <[email protected]>

A quick review of the patch shows that it is fundamentally flawed
since all it is doing is moving reset to the path where the link goes
down. However that doesn't even really resolve the original issue
since the complaint was that  the NIC was resetting because netconsole
was queueing packets while the link down. Without the reset the
packets are just going to queue up on the interface and the first time
the interface comes up it will trigger a Tx hang message as has been
seen here.

I would recommend reverting the above patch and then addressing the
original problem. The question we should be asking is why are we
enqueueing packets on a ring of the device when it doesn't have link?

A better fix might be to remove the netif_start_queue in the e1000e_up
call, replace it with netif_stop_queue in e1000e_open, place a call to
netif_wake_queue just before the netif_carrier_on in the watchdog
task, and to add a call to netif_stop_queue just after the
netif_carrier_off in the watchdog task. That should prevent us from
enqueuing packets on a interface with no link, and would still allow
us to flush packets out if they somehow got by all that and were still
enqueued to the Tx queue.


_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] [e1000e REGRESSION BISECTED] Detected Hardware Unit Hang with 5.0.7

Reply via email to