Re: e1000e driver Network Card Detected Hardware Unit Hang

2024-04-16 Thread Sirius
In days of yore (Tue, 16 Apr 2024), Jamie thus quoth: 
> Look this is a kernel bug and Debian needs to
> fix this! Don't give me any of this crap about upstream
> this is a bug with the Debian Kernel!

Pay attention, because I am now in Support Mode as a former Principal
Technical Account Manager for Red Hat.


No, this is not necessarily a kernel bug. It can be a hardware bug and it
is plausible it can not be solved with a driver work-around.

You are hitting a problem and you want someone else to fix it for you. The
answer may simply be that you need to replace the NIC with something else.

FWIW, I have these Intel NICs in my two NUCs and they are functioning
fine. With Debian 12.5 and the latest updates.

$ lspci -v -s 00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection I219-V (rev 
21)
DeviceName:  LAN
Subsystem: Intel Corporation Ethernet Connection I219-V
Flags: bus master, fast devsel, latency 0, IRQ 123, IOMMU group 7
Memory at df10 (32-bit, non-prefetchable) [size=128K]
Capabilities: 
Kernel driver in use: e1000e
Kernel modules: e1000e

The revision of the NIC may determine whether you have *hardware* problems
or not.

> This needs to be fixed!

Quick answer: replace the NIC. And do some groundwork to determine if the
NIC you replace it with has issues you should be aware of or not.

> I have already tried disabling the offloads and it does
> not work.

The specific offloads seemed to be the CRC related ones.

# ethtool -k eno1
Features for eno1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]

Note: when you disable these, throughput can drop sharply.

The other setting suggested was to hike the TX ringbuffer.

# ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX: 4096
RX Mini:n/a
RX Jumbo:   n/a
TX: 4096
Current hardware settings:
RX: 256
RX Mini:n/a
RX Jumbo:   n/a
TX: 256
RX Buf Len: n/a
CQE Size:   n/a
TX Push:off
TCP data split: n/a

# ethtool -G eno1 tx 2048 rx 2048
# ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX: 4096
RX Mini:n/a
RX Jumbo:   n/a
TX: 4096
Current hardware settings:
RX: 2048
RX Mini:n/a
RX Jumbo:   n/a
TX: 2048
RX Buf Len: n/a
CQE Size:   n/a
TX Push:off
TCP data split: n/a

The reason the ringbuffers are important is that the kernel and the OS can
construct packets faster in bursts than the NIC can handle, so the OS can
queue up packets in the ringbuffer and the NIC can asynchronously pick the
packets from the buffer and send them across the wire. If the ringbuffers
are set too small, they will overflow and you will get overflow errors.

# ethtool -S eno1
NIC statistics:
 rx_packets: 24463
 tx_packets: 6358
 rx_bytes: 3093199
 tx_bytes: 669733
 rx_broadcast: 8044
 tx_broadcast: 9
 rx_multicast: 10434
 tx_multicast: 2510
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 0  If buffers are set too small, this increases
 multicast: 10434
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_abort_late_coll: 0
 tx_deferred_ok: 0
 tx_single_coll_ok: 0
 tx_multi_coll_ok: 0
 tx_timeout_count: 0
 tx_restart_queue: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_align_errors: 0
 tx_tcp_seg_good: 0
 tx_tcp_seg_failed: 0
 rx_flow_control_xon: 9
 rx_flow_control_xoff: 9
 tx_flow_control_xon: 0
 tx_flow_control_xoff: 0
 rx_csum_offload_good: 8539 If you have issues with checksum
 rx_csum_offload_errors: 0  offload, check these
 rx_header_split: 0
 alloc_rx_buff_failed: 0
 tx_smbus: 0
 rx_smbus: 0
 dropped_smbus: 0
 rx_dma_failed: 0
 tx_dma_failed: 0
 rx_hwtstamp_cleared: 0
 uncorr_ecc_errors: 0
 corr_ecc_errors: 0
 tx_hwtstamp_timeouts: 0
 tx_hwtstamp_skipped: 0

> It isn't the cable either I have tried different cables it
> still happens! This is an issue with the Kernel module for
> the e1000e NIC card.

Excellent data-point, you have ruled out whether the cable is faulty or
not. But your assumption that this is the kernel module that is broken
is still faulty.

Provably, I am running the same type of NIC (albeit a different revision)
with the same driver and I do not observe any issues. Thus, leveraging
Occam's Razor, it follows that scrutinising your particular NIC 

Re: e1000e driver Network Card Detected Hardware Unit Hang

2024-04-16 Thread tomas
On Tue, Apr 16, 2024 at 09:05:29AM -0400, Stefan Monnier wrote:
> > It has been known to happen that drivers implement workarounds for issues
> > in the hardware itself, so that hardware bugs do not get tripped (or are
> > tripped less often).
> 
> 
> 
> You make it sound like it's a rare occurrence, but it's actually
> quite common.  Most of it is discrete so you'll rarely be exposed to it,
> but `grep bugs /proc/cpuinfo` is one of the places where you can see it
> being somewhat documented.

One might argue that a driver's whole raison d'être /is/ to work around
hardware bugs. But then, perhaps I'm a cynic ;-)

Cheers
-- 
t


signature.asc
Description: PGP signature


Re: e1000e driver Network Card Detected Hardware Unit Hang

2024-04-16 Thread Stefan Monnier
> It has been known to happen that drivers implement workarounds for issues
> in the hardware itself, so that hardware bugs do not get tripped (or are
> tripped less often).



You make it sound like it's a rare occurrence, but it's actually
quite common.  Most of it is discrete so you'll rarely be exposed to it,
but `grep bugs /proc/cpuinfo` is one of the places where you can see it
being somewhat documented.


Stefan



Re: e1000e driver Network Card Detected Hardware Unit Hang

2024-04-15 Thread Sirius
In days of yore (Tue, 16 Apr 2024), Sirius thus quoth: 
> In days of yore (Mon, 15 Apr 2024), Jamie thus quoth: 
> > So  there is a very nasty bug in the e1000e network card
> > driver.

Doing some reading turned up a Proxmox thread about the issues with these
Intel NICs.

https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-10

May be worth scanning that thread and applying some of their solutions to
this problem.

-- 
Kind regards,

/S



Re: e1000e driver Network Card Detected Hardware Unit Hang

2024-04-15 Thread Sirius
In days of yore (Mon, 15 Apr 2024), Jamie thus quoth: 
> So  there is a very nasty bug in the e1000e network card
> driver.

https://www.intel.com/content/www/us/en/support/articles/05480/ethernet-products.html
notes that MSI interrupts may be problematic on some systems. Worth
digging into whether that is an issue on this system of yours.

I am not sure Debian can resolve this problem with the driver, but
upstream kernel folks might. Unsure whether Intel still helps maintain
this driver as it is quite old (I dealt with support issues on this driver
some 15-16 years ago).

The Intel page states this is upstream kernel only at this point, so going
to SourceForge for their out-of-tree driver is no longer an option.

> I am running Debian 12 Bookworm.
> 
> You will get the message "Detected Hardware Unit Hang" and then
> the network card just stops working.
[snip]

> This is a gigabit network card as I said it is a built in NIC I believe it
> is an Intel NIC.

It is an Intel NIC. Most of the NIC drivers beginning with an 'e' followed
by numbers are Intel as far as I know. These NICs were very common as
on-board NICs in OEM systems as Intel provided them in large volumes. They
are not the best, but they usually do their job.

[snip]
> This seems to happen when you are actually pushing a bit of traffic
> though it not a lot but just even a little bit.  It isn't network overload
> or anything I am barely doing anything really but it will do this.

If it is a hardware hang, it may be the NIC firmware getting its knickers
in a twist, and that is not something the kernel or the driver can do much
about.

> I have already tried  the following
> 
> ethtool -K eth1 tx off rx off
> ethtool -K eth1 tso off gso off
> ethtool -K eth1 gso off gro off tso off tx off rx off rxvlan off txvlan
> off sg off

All worthwhile things to try. You can also try reducing the RX buffers
from the default 4096 to 2048 if you are not running a lot of traffic. It
might not help, but worth trying.

> I have disabled all power management in the bios as well including the one
> for ASPM
> 
> I added the following to grub
> 
> pcie_aspm=off e1000e.SmartPowerDownEnable=0
> 
> 
> This is in /etc/default/grub
> GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off
> e1000e.SmartPowerDownEnable=0"

Good thinking about power management. :)

> Then I did an update-grub as well.
> 
> None of this has worked in fixing this problem.  I am still getting the
> same issue.

Best bet at this point would be to scout the Linux Kernel Mailing List
archives to see if anyone else have run into the same problems, and then
reviewing the kernel maintainers list to find someone that works on the
e1000e driver to strike up a direct dialogue with them.

> Can you please fix this issue this is a really nasty problem with Debian
> 12 (Bookworm)
> 
> I am seeing this being reported back in Kernel 5.3.x but i am not seeing any
> reports for 6.1.x about this issue.
> 
> Debian Bug report logs - #945912
> Kernel 5.3 e100e Detected Hardware Unit Hang
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945912

If it has been reported before and is still present now, one of two things
is likely true.
 1) the problem was intermittent and could not be reliably reproduced in
order to debug and resolve
 2) the problem was related to the hardware itself, and there was no way
to fix it in either driver or firmware

It has been known to happen that drivers implement workarounds for issues
in the hardware itself, so that hardware bugs do not get tripped (or are
tripped less often).

> Please reply back and confirm that you got this email and that you are
> looking into this problem please.

To state the obvious, I am not a kernel maintainer for Debian and do not
speak on behalf of the Debian project.

I work for a Linux company you may have heard of and have done so for
almost eighteen years, a decade of which was in support. 15 years ago, I
know exactly who I would have gone to to look into this problem, but he
now works for Broadcom and probably has forgotten all about the
e1000/e1000e drivers.

Upstream driver maintainer would be the best bet IMHO. If this driver is
community support only (i.e. if Intel no longer participates in driver
maintenance), I would say that all bets are off.

With only one datapoint - your system and your NIC, it is not possible to
rule out that the NIC itself is bad. :-/

> -- This email message, including any attachments, is for the intended
> recipient(s) only and may contain information that is privileged,
> confidential and/or exempt from disclosure under applicable law. If you
> have received this message in error, or are obviously not one of the
> intended recipients, please immediately notify the sender by reply email
> and delete this email message, including any attachments.  All
> information in this email including any attachment(s) is to be kept in
> strict confidence and is not to be released to anyone without my prior
> written 

e1000e driver Network Card Detected Hardware Unit Hang

2024-04-15 Thread Jamie

So  there is a very nasty bug in the e1000e network card
driver.

I am running Debian 12 Bookworm.

You will get the message "Detected Hardware Unit Hang" and then
the network card just stops working.

This is a built in NIC  on the computer
The computer is a is a HP Prodesk 600 G4 MT

This is the mini tower version as denoted by the MT.


This log comes from my /var/log/syslog.


Apr 15 01:57:12 gateway vmunix: [ 7743.893557] e1000e :00:1f.6 eth1: 
Detected Hardware Unit Hang:

Apr 15 01:57:12 gateway vmunix: [ 7743.893557] TDH  
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] TDT  
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] next_to_use  
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] next_to_clean    
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] buffer_info[next_to_clean]:
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] time_stamp   
<1001c6345>

Apr 15 01:57:12 gateway vmunix: [ 7743.893557] next_to_watch    
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] jiffies  
<1001c6550>

Apr 15 01:57:12 gateway vmunix: [ 7743.893557] next_to_watch.status <0>
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] MAC Status 
<80083>

Apr 15 01:57:12 gateway vmunix: [ 7743.893557] PHY Status <796d>
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] PHY 1000BASE-T Status  <3800>
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] PHY Extended Status    <3000>
Apr 15 01:57:12 gateway vmunix: [ 7743.893557] PCI Status <10>
Apr 15 01:57:13 gateway vmunix: [ 7744.123237] net-fw DROP IN=eth0 OUT= 
MAC=00:13:3b:e3:8f:b0:0c:a4:02:35:6d:87:08:00 SRC=75.159.223.219 
DST=199.126.41.116 LE>
Apr 15 01:57:13 gateway vmunix: [ 7744.417235] net-fw DROP IN=eth0 OUT= 
MAC=00:13:3b:e3:8f:b0:0c:a4:02:35:6d:87:08:00 SRC=75.159.223.219 
DST=199.126.41.116 LE>
Apr 15 01:57:14 gateway vmunix: [ 7745.412183] net-fw DROP IN=eth0 OUT= 
MAC=00:13:3b:e3:8f:b0:0c:a4:02:35:6d:87:08:00 SRC=75.159.223.219 
DST=199.126.41.116 LE>
Apr 15 01:57:14 gateway vmunix: [ 7745.659234] net-fw DROP IN=eth0 OUT= 
MAC=00:13:3b:e3:8f:b0:0c:a4:02:35:6d:87:08:00 SRC=75.159.223.219 
DST=199.126.41.116 LE>
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] e1000e :00:1f.6 eth1: 
Detected Hardware Unit Hang:

Apr 15 01:57:14 gateway vmunix: [ 7745.877564] TDH  
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] TDT  
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] next_to_use  
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] next_to_clean    
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] buffer_info[next_to_clean]:
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] time_stamp   
<1001c6345>

Apr 15 01:57:14 gateway vmunix: [ 7745.877564] next_to_watch    
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] jiffies  
<1001c6740>

Apr 15 01:57:14 gateway vmunix: [ 7745.877564] next_to_watch.status <0>
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] MAC Status 
<80083>

Apr 15 01:57:14 gateway vmunix: [ 7745.877564] PHY Status <796d>
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] PHY 1000BASE-T Status  <3800>
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] PHY Extended Status    <3000>
Apr 15 01:57:14 gateway vmunix: [ 7745.877564] PCI Status <10>
Apr 15 01:57:15 gateway vmunix: [ 7746.220253] net-fw DROP IN=eth0 OUT= 
MAC=00:13:3b:e3:8f:b0:0c:a4:02:35:6d:87:08:00 SRC=75.159.223.219 
DST=199.126.41.116 LE>
Apr 15 01:57:15 gateway vmunix: [ 7746.485268] net-fw DROP IN=eth0 OUT= 
MAC=00:13:3b:e3:8f:b0:0c:a4:02:35:6d:87:08:00 SRC=75.159.223.219 
DST=199.126.41.116 LE>
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] e1000e :00:1f.6 eth1: 
Detected Hardware Unit Hang:

Apr 15 01:57:16 gateway vmunix: [ 7747.893578] TDH  
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] TDT  
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] next_to_use  
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] next_to_clean    
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] buffer_info[next_to_clean]:
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] time_stamp   
<1001c6345>

Apr 15 01:57:16 gateway vmunix: [ 7747.893578] next_to_watch    
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] jiffies  
<1001c6938>

Apr 15 01:57:16 gateway vmunix: [ 7747.893578] next_to_watch.status <0>
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] MAC Status 
<80083>

Apr 15 01:57:16 gateway vmunix: [ 7747.893578] PHY Status <796d>
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] PHY 1000BASE-T Status  <3800>
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] PHY Extended Status    <3000>
Apr 15 01:57:16 gateway vmunix: [ 7747.893578] PCI Status <10>


It does this multiple times and the network interface in this case eth1 
becomes
unstable and just stops responding now I can't have that because this 
computer
is being used as a gateway.  Usually