Hi netdev, After a recent Fedora CoreOS upgrade at the beginning of February, we are experiencing several kernel crashes on multiple nodes of the same style hardware, which are pointing to an issue with the ixgbe driver.
The affected nodes reboot following a kernel NULL pointer dereference immediately after an ixgbe "Malicious event on VF" message. Feb 27 21:34:34 d8-r11-c8-n3 kernel: ixgbe 0000:21:00.0: Malicious event on VF 3 tx:80000 rx:0 Feb 27 21:34:34 d8-r11-c8-n3 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000304 Different node in another chassis (identical hardware): Feb 13 10:42:49 d8-r11-c9-n2 kernel: ixgbe 0000:21:00.0: Malicious event on VF 12 tx:80000 rx:0 Feb 13 10:42:49 d8-r11-c9-n2 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000b2c This has occurred on at least five separate nodes since our FCOS upgrade maintenance on February 3. After reboot, nodes return to normal operation until the next occurrence. Currently for each of these systems, the journal always truncates after the BUG line. I will increase the panic delay and capture settings and update if we happen to catch a more meaningful trace before the reboot triggers. Here's some relevant info pertaining to our environment: - Linux kernel: 6.17.7-300.fc43.x86_64 - OS: Fedora CoreOS 43.20251110.3.1 Hardware Info: - Gigabyte MZ62-HD0 nodes (H262-Z62 chassis) -- happening across multiple nodes in multiple chassis in the stack (i.e., not isolated to a single chassis) - CPU is AMD EPYC 7302 - The NIC causing issues is: Intel X550 (rev 01) dual-port 10GBASE-T -- Bonded interfaces (802.3ad) to redundant leaf switches Driver/mod Info: - driver: ixgbe - version: 6.17.7-300.fc43.x86_64 - firmware-version: 0x80000c67, 1.1276.0 >From the modinfo: - filename: /lib/modules/6.17.7-300.fc43.x86_64/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko.xz - description: Intel(R) 10 Gigabit PCI Express Network Driver - rhelversion: 10.99 The SR-IOV capability is present on the X550 adapter, however no VFs are configured: /sys/class/net/enp33s0f0/device/sriov_numvfs = 0 /sys/module/ixgbe/parameters/ has only allow_unsupported_sfp = N Also, no VF PCI devices appear in lspci output. In checking the priv flags, I noticed there's one for mdd-disable-vf. I can try to set mdd-disable-vf to on after sending to see if that helps as a potential mitigation, but the nondeterministic nature of this issue means it will take some time for us to know whether this change restores stability: Private flags for enp33s0f0: - legacy-rx : off - vf-ipsec : off - mdd-disable-vf: off I'm wondering if this is a known issue in recent kernels affecting ixgbe/X550 devices when MDD events are triggered without SR-IOV VFs configured? I could not find anything recent in my searches, so I thought I would reach out to report the behavior and see if there's anything I might be missing. Thanks for your time, Melissa
