On Mon, Apr 15, 2019 at 11:22 AM Joseph Yasi <joe.y...@gmail.com> wrote: > > Hello, > I reported a regression that happened after upgrading from 5.0.6 to 5.0.7: > https://bugzilla.kernel.org/show_bug.cgi?id=203175 > > This is fixed by reverting commit > 7f0a3a436e88a71b96694c029f01a9a8eade3d5d e1000e: fix cyclic resets at link > up with active tx. A few others have reported the same hang in bugzilla. > > Thanks, > Joe Yasi > > dmesg of hang: > [Sat Apr 6 00:12:10 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full > Duplex, Flow Control: Rx/Tx > [Sat Apr 6 00:12:10 2019] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link > becomes ready > [Sat Apr 6 00:12:12 2019] e1000e 0000:00:1f.6 eth0: Detected Hardware Unit > Hang: > > TDH <0> > > TDT <1> > > next_to_use <1> > > next_to_clean <0> > > buffer_info[next_to_clean]: > > time_stamp <fffba7a7> > > next_to_watch <0> > jiffies <fffbb140> > next_to_watch.status <0> > MAC Status <40080080> > PHY Status <7949> > PHY 1000BASE-T Status <0> > PHY Extended Status <3000> > PCI Status <10> > [Sat Apr 6 00:12:14 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full > Duplex, Flow Control: Rx/Tx > > lspci -vv > 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) > I219-V > Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0 > Interrupt: pin A routed to IRQ 145 > Region 0: Memory at df400000 (32-bit, non-prefetchable) [size=128K] > Capabilities: [c8] Power Management version 3 > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > Address: 00000000fee00518 Data: 0000 > Capabilities: [e0] PCI Advanced Features > AFCap: TP+ FLR+ > AFCtrl: FLR- > AFStatus: TP- > Kernel driver in use: e1000e > Kernel modules: e1000 >
So the commit ID you reported doesn't match up to the value in the kernel. I believe the patch you are talking about is: commit 0f9e980bf5ee1a97e2e401c846b2af989eb21c61 Author: Konstantin Khlebnikov <khlebni...@yandex-team.ru> Date: Mon Jan 14 16:29:30 2019 +0300 e1000e: fix cyclic resets at link up with active tx I'm seeing series of e1000e resets (sometimes endless) at system boot if something generates tx traffic at this time. In my case this is netconsole who sends message "e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames" from e1000e itself. As result e1000_watchdog_task sees used tx buffer while carrier is off and start this reset cycle again. [ 17.794359] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 17.794714] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [ 22.936455] e1000e 0000:02:00.0 eth1: changing MTU from 1500 to 9000 [ 23.033336] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 26.102364] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 27.174495] 8021q: 802.1Q VLAN Support v1.8 [ 27.174513] 8021q: adding VLAN 0 to HW filter on device eth1 [ 30.671724] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation [ 30.898564] netpoll: netconsole: local port 6666 [ 30.898566] netpoll: netconsole: local IPv6 address 2a02:6b8:0:80b:beae:c5ff:fe28:23f8 [ 30.898567] netpoll: netconsole: interface 'eth1' [ 30.898568] netpoll: netconsole: remote port 6666 [ 30.898568] netpoll: netconsole: remote IPv6 address 2a02:6b8:b000:605c:e61d:2dff:fe03:3790 [ 30.898569] netpoll: netconsole: remote ethernet address b0:a8:6e:f4:ff:c0 [ 30.917747] console [netcon0] enabled [ 30.917749] netconsole: network logging started [ 31.453353] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 34.185730] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 34.321840] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 34.465822] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 34.597423] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 34.745417] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 34.877356] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 35.005441] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 35.157376] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 35.289362] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 35.417441] e1000e 0000:02:00.0: Some CPU C-states have been disabled in order to enable jumbo frames [ 37.790342] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None This patch flushes tx buffers only once when carrier is off rather than at each watchdog iteration. Signed-off-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru> Tested-by: Aaron Brown <aaron.f.br...@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com> A quick review of the patch shows that it is fundamentally flawed since all it is doing is moving reset to the path where the link goes down. However that doesn't even really resolve the original issue since the complaint was that the NIC was resetting because netconsole was queueing packets while the link down. Without the reset the packets are just going to queue up on the interface and the first time the interface comes up it will trigger a Tx hang message as has been seen here. I would recommend reverting the above patch and then addressing the original problem. The question we should be asking is why are we enqueueing packets on a ring of the device when it doesn't have link? A better fix might be to remove the netif_start_queue in the e1000e_up call, replace it with netif_stop_queue in e1000e_open, place a call to netif_wake_queue just before the netif_carrier_on in the watchdog task, and to add a call to netif_stop_queue just after the netif_carrier_off in the watchdog task. That should prevent us from enqueuing packets on a interface with no link, and would still allow us to flush packets out if they somehow got by all that and were still enqueued to the Tx queue. _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired