Re: [E1000-devel] [e1000e REGRESSION BISECTED] Detected Hardware Unit Hang with 5.0.7

Joseph Yasi Tue, 16 Apr 2019 10:19:33 -0700

On Tue, Apr 16, 2019, 1:12 PM Alexander Duyck <alexander.du...@gmail.com>
wrote:


> On Mon, Apr 15, 2019 at 11:22 AM Joseph Yasi <joe.y...@gmail.com> wrote:
> >
> > Hello,
> > I reported a regression that happened after upgrading from 5.0.6 to
> 5.0.7:
> > https://bugzilla.kernel.org/show_bug.cgi?id=203175
> >
> > This is fixed by reverting commit
> > 7f0a3a436e88a71b96694c029f01a9a8eade3d5d e1000e: fix cyclic resets at
> link
> > up with active tx. A few others have reported the same hang in bugzilla.
> >
> > Thanks,
> > Joe Yasi
> >
> > dmesg of hang:
> > [Sat Apr  6 00:12:10 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full
> > Duplex, Flow Control: Rx/Tx
> > [Sat Apr  6 00:12:10 2019] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link
> > becomes ready
> > [Sat Apr  6 00:12:12 2019] e1000e 0000:00:1f.6 eth0: Detected Hardware
> Unit
> > Hang:
> >
> >                              TDH                  <0>
> >
> >                              TDT                  <1>
> >
> >                              next_to_use          <1>
> >
> >                              next_to_clean        <0>
> >
> >                            buffer_info[next_to_clean]:
> >
> >                              time_stamp           <fffba7a7>
> >
> >                              next_to_watch        <0>
> >                              jiffies              <fffbb140>
> >                              next_to_watch.status <0>
> >                            MAC Status             <40080080>
> >                            PHY Status             <7949>
> >                            PHY 1000BASE-T Status  <0>
> >                            PHY Extended Status    <3000>
> >                            PCI Status             <10>
> > [Sat Apr  6 00:12:14 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full
> > Duplex, Flow Control: Rx/Tx
> >
> > lspci -vv
> > 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2)
> > I219-V
> >         Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V
> >         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr-
> > Stepping- SERR- FastB2B- DisINTx+
> >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > <TAbort- <MAbort- >SERR- <PERR- INTx-
> >         Latency: 0
> >         Interrupt: pin A routed to IRQ 145
> >         Region 0: Memory at df400000 (32-bit, non-prefetchable)
> [size=128K]
> >         Capabilities: [c8] Power Management version 3
> >                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> > PME(D0+,D1-,D2-,D3hot+,D3cold+)
> >                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> >         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >                 Address: 00000000fee00518  Data: 0000
> >         Capabilities: [e0] PCI Advanced Features
> >                 AFCap: TP+ FLR+
> >                 AFCtrl: FLR-
> >                 AFStatus: TP-
> >         Kernel driver in use: e1000e
> >         Kernel modules: e1000
> >
>
> So the commit ID you reported doesn't match up to the value in the
> kernel. I believe the patch you are talking about is:
> commit 0f9e980bf5ee1a97e2e401c846b2af989eb21c61
> Author: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
>

It matches the commit ID in the linux-5.0.y stable branch.
Yes, 0f9e980bf5ee1a97e2e401c846b2af989eb21c61 is the upstream commit ID.

Date:   Mon Jan 14 16:29:30 2019 +0300
>
>     e1000e: fix cyclic resets at link up with active tx
>
>     I'm seeing series of e1000e resets (sometimes endless) at system boot
>     if something generates tx traffic at this time. In my case this is
>     netconsole who sends message "e1000e 0000:02:00.0: Some CPU C-states
>     have been disabled in order to enable jumbo frames" from e1000e itself.
>     As result e1000_watchdog_task sees used tx buffer while carrier is off
>     and start this reset cycle again.
>
>     [   17.794359] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
> Flow Control: None
>     [   17.794714] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
>     [   22.936455] e1000e 0000:02:00.0 eth1: changing MTU from 1500 to 9000
>     [   23.033336] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   26.102364] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
> Flow Control: None
>     [   27.174495] 8021q: 802.1Q VLAN Support v1.8
>     [   27.174513] 8021q: adding VLAN 0 to HW filter on device eth1
>     [   30.671724] cgroup: cgroup: disabling cgroup2 socket matching
> due to net_prio or net_cls activation
>     [   30.898564] netpoll: netconsole: local port 6666
>     [   30.898566] netpoll: netconsole: local IPv6 address
> 2a02:6b8:0:80b:beae:c5ff:fe28:23f8
>     [   30.898567] netpoll: netconsole: interface 'eth1'
>     [   30.898568] netpoll: netconsole: remote port 6666
>     [   30.898568] netpoll: netconsole: remote IPv6 address
> 2a02:6b8:b000:605c:e61d:2dff:fe03:3790
>     [   30.898569] netpoll: netconsole: remote ethernet address
> b0:a8:6e:f4:ff:c0
>     [   30.917747] console [netcon0] enabled
>     [   30.917749] netconsole: network logging started
>     [   31.453353] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   34.185730] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   34.321840] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   34.465822] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   34.597423] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   34.745417] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   34.877356] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   35.005441] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   35.157376] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   35.289362] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   35.417441] e1000e 0000:02:00.0: Some CPU C-states have been
> disabled in order to enable jumbo frames
>     [   37.790342] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex,
> Flow Control: None
>
>     This patch flushes tx buffers only once when carrier is off
>     rather than at each watchdog iteration.
>
>     Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
>     Tested-by: Aaron Brown <aaron.f.br...@intel.com>
>     Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
>
> A quick review of the patch shows that it is fundamentally flawed
> since all it is doing is moving reset to the path where the link goes
> down. However that doesn't even really resolve the original issue
> since the complaint was that  the NIC was resetting because netconsole
> was queueing packets while the link down. Without the reset the
> packets are just going to queue up on the interface and the first time
> the interface comes up it will trigger a Tx hang message as has been
> seen here.
>
> I would recommend reverting the above patch and then addressing the
> original problem. The question we should be asking is why are we
> enqueueing packets on a ring of the device when it doesn't have link?
>
> A better fix might be to remove the netif_start_queue in the e1000e_up
> call, replace it with netif_stop_queue in e1000e_open, place a call to
> netif_wake_queue just before the netif_carrier_on in the watchdog
> task, and to add a call to netif_stop_queue just after the
> netif_carrier_off in the watchdog task. That should prevent us from
> enqueuing packets on a interface with no link, and would still allow
> us to flush packets out if they somehow got by all that and were still
> enqueued to the Tx queue.
>

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] [e1000e REGRESSION BISECTED] Detected Hardware Unit Hang with 5.0.7

Reply via email to