On Tue, Apr 16, 2019, 1:12 PM Alexander Duyck <alexander.du...@gmail.com> wrote:
> On Mon, Apr 15, 2019 at 11:22 AM Joseph Yasi <joe.y...@gmail.com> wrote: > > > > Hello, > > I reported a regression that happened after upgrading from 5.0.6 to > 5.0.7: > > https://bugzilla.kernel.org/show_bug.cgi?id=203175 > > > > This is fixed by reverting commit > > 7f0a3a436e88a71b96694c029f01a9a8eade3d5d e1000e: fix cyclic resets at > link > > up with active tx. A few others have reported the same hang in bugzilla. > > > > Thanks, > > Joe Yasi > > > > dmesg of hang: > > [Sat Apr 6 00:12:10 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full > > Duplex, Flow Control: Rx/Tx > > [Sat Apr 6 00:12:10 2019] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link > > becomes ready > > [Sat Apr 6 00:12:12 2019] e1000e 0000:00:1f.6 eth0: Detected Hardware > Unit > > Hang: > > > > TDH <0> > > > > TDT <1> > > > > next_to_use <1> > > > > next_to_clean <0> > > > > buffer_info[next_to_clean]: > > > > time_stamp <fffba7a7> > > > > next_to_watch <0> > > jiffies <fffbb140> > > next_to_watch.status <0> > > MAC Status <40080080> > > PHY Status <7949> > > PHY 1000BASE-T Status <0> > > PHY Extended Status <3000> > > PCI Status <10> > > [Sat Apr 6 00:12:14 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full > > Duplex, Flow Control: Rx/Tx > > > > lspci -vv > > 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) > > I219-V > > Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr- > > Stepping- SERR- FastB2B- DisINTx+ > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > <TAbort- <MAbort- >SERR- <PERR- INTx- > > Latency: 0 > > Interrupt: pin A routed to IRQ 145 > > Region 0: Memory at df400000 (32-bit, non-prefetchable) > [size=128K] > > Capabilities: [c8] Power Management version 3 > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > > PME(D0+,D1-,D2-,D3hot+,D3cold+) > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- > > Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > > Address: 00000000fee00518 Data: 0000 > > Capabilities: [e0] PCI Advanced Features > > AFCap: TP+ FLR+ > > AFCtrl: FLR- > > AFStatus: TP- > > Kernel driver in use: e1000e > > Kernel modules: e1000 > > > > So the commit ID you reported doesn't match up to the value in the > kernel. I believe the patch you are talking about is: > commit 0f9e980bf5ee1a97e2e401c846b2af989eb21c61 > Author: Konstantin Khlebnikov <khlebni...@yandex-team.ru> > It matches the commit ID in the linux-5.0.y stable branch. Yes, 0f9e980bf5ee1a97e2e401c846b2af989eb21c61 is the upstream commit ID. Date: Mon Jan 14 16:29:30 2019 +0300 > > e1000e: fix cyclic resets at link up with active tx > > I'm seeing series of e1000e resets (sometimes endless) at system boot > if something generates tx traffic at this time. In my case this is > netconsole who sends message "e1000e 0000:02:00.0: Some CPU C-states > have been disabled in order to enable jumbo frames" from e1000e itself. > As result e1000_watchdog_task sees used tx buffer while carrier is off > and start this reset cycle again. > > [ 17.794359] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, > Flow Control: None > [ 17.794714] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready > [ 22.936455] e1000e 0000:02:00.0 eth1: changing MTU from 1500 to 9000 > [ 23.033336] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 26.102364] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, > Flow Control: None > [ 27.174495] 8021q: 802.1Q VLAN Support v1.8 > [ 27.174513] 8021q: adding VLAN 0 to HW filter on device eth1 > [ 30.671724] cgroup: cgroup: disabling cgroup2 socket matching > due to net_prio or net_cls activation > [ 30.898564] netpoll: netconsole: local port 6666 > [ 30.898566] netpoll: netconsole: local IPv6 address > 2a02:6b8:0:80b:beae:c5ff:fe28:23f8 > [ 30.898567] netpoll: netconsole: interface 'eth1' > [ 30.898568] netpoll: netconsole: remote port 6666 > [ 30.898568] netpoll: netconsole: remote IPv6 address > 2a02:6b8:b000:605c:e61d:2dff:fe03:3790 > [ 30.898569] netpoll: netconsole: remote ethernet address > b0:a8:6e:f4:ff:c0 > [ 30.917747] console [netcon0] enabled > [ 30.917749] netconsole: network logging started > [ 31.453353] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 34.185730] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 34.321840] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 34.465822] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 34.597423] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 34.745417] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 34.877356] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 35.005441] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 35.157376] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 35.289362] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 35.417441] e1000e 0000:02:00.0: Some CPU C-states have been > disabled in order to enable jumbo frames > [ 37.790342] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, > Flow Control: None > > This patch flushes tx buffers only once when carrier is off > rather than at each watchdog iteration. > > Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru> > Tested-by: Aaron Brown <aaron.f.br...@intel.com> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com> > > A quick review of the patch shows that it is fundamentally flawed > since all it is doing is moving reset to the path where the link goes > down. However that doesn't even really resolve the original issue > since the complaint was that the NIC was resetting because netconsole > was queueing packets while the link down. Without the reset the > packets are just going to queue up on the interface and the first time > the interface comes up it will trigger a Tx hang message as has been > seen here. > > I would recommend reverting the above patch and then addressing the > original problem. The question we should be asking is why are we > enqueueing packets on a ring of the device when it doesn't have link? > > A better fix might be to remove the netif_start_queue in the e1000e_up > call, replace it with netif_stop_queue in e1000e_open, place a call to > netif_wake_queue just before the netif_carrier_on in the watchdog > task, and to add a call to netif_stop_queue just after the > netif_carrier_off in the watchdog task. That should prevent us from > enqueuing packets on a interface with no link, and would still allow > us to flush packets out if they somehow got by all that and were still > enqueued to the Tx queue. > _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired