On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote: > Hi, > > in a setup where I use native XDP to redirect packets to a bonding interface > that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly > resets the NIC with the following kernel output: > > ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP) > Tx Queue <4> > TDH, TDT <17e>, <17e> > next_to_use <181> > next_to_clean <17e> > tx_buffer_info[next_to_clean] > time_stamp <0> > jiffies <10025c380> > ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting > adapter > ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout > ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter > > This only occurs in combination with a bonding interface and XDP, so I don't > know if this is an issue with ixgbe or the bonding driver. > I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1 > show the same issue. > > > I managed to reproduce this bug in a lab environment. Here are some details > about my setup and the steps to reproduce the bug: > > NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) > > CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz > Also reproduced on: > - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz > - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz > > Kernel: 6.15.0-rc1 (built from mainline) > > # ethtool -i ixgbe-x520-1 > driver: ixgbe > version: 6.15.0-rc1 > firmware-version: 0x00012b2c, 1.3429.0 > expansion-rom-version: > bus-info: 0000:01:00.0 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: yes > > The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are > directly > connected with each other using a DAC cable. Both ports are configured to be > slaves of a bonding with mode balance-rr. > Neither the direct connection of both ports nor the round-robin bonding mode > are a requirement to reproduce the issue. This setup just allows it to be > easier > reproduced in an isolated environment. The issue is also visible with a > regular > 802.3ad link aggregation with a switch on the other side. > > # modprobe bonding > # ip link set dev ixgbe-x520-1 down > # ip link set dev ixgbe-x520-2 down > # ip link add bond0 type bond mode balance-rr > # ip link set dev ixgbe-x520-1 master bond0 > # ip link set dev ixgbe-x520-2 master bond0 > # ip link set dev ixgbe-x520-1 up > # ip link set dev ixgbe-x520-2 up > # ip link set dev bond0 up > > # cat /proc/net/bonding/bond0 > Ethernet Channel Bonding Driver: v6.15.0-rc1 > > Bonding Mode: load balancing (round-robin) > MII Status: up > MII Polling Interval (ms): 0 > Up Delay (ms): 0 > Down Delay (ms): 0 > Peer Notification Delay (ms): 0 > > Slave Interface: ixgbe-x520-1 > MII Status: up > Speed: 10000 Mbps > Duplex: full > Link Failure Count: 0 > Permanent HW addr: 6c:b3:11:08:5c:3c > Slave queue ID: 0 > > Slave Interface: ixgbe-x520-2 > MII Status: up > Speed: 10000 Mbps > Duplex: full > Link Failure Count: 0 > Permanent HW addr: 6c:b3:11:08:5c:3e > Slave queue ID: 0 > > # ethtool -l ixgbe-x520-1 > Channel parameters for ixgbe-x520-1: > Pre-set maximums: > RX: n/a > TX: n/a > Other: 1 > Combined: 63 > Current hardware settings: > RX: n/a > TX: n/a > Other: 1 > Combined: 63 > (same for ixgbe-x520-2) > > In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/ > are used. > > Enable XDP on the bonding and make sure all received packets will be dropped: > # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0 > > Redirect a batch of packets to the bonding interface: > # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0> > --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0 > > Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors > (see above) will show up in the kernel log. > > The high number of packets and thread count (--threads 16) is not required to > trigger the issue but greatly improves the probability. > > > Do you have any ideas what may be causing this issue or what I can do to > diagnose this further? > > Please let me know when I should provide any more information. > > > Thanks! > Marcus >
Hi Marcus, Thank you for reporting this issue! I have just successfully reproduced the problem on our lab machine. What is interesting is that I do not seem to have to use a bonding interface to get the "Tx timeout" that causes the adapter to reset. I will try to debug the problem more closely and let you know of any updates. Thanks, Michal