Hello,
I'm migrating some systems that I used to use the 700 series network
adapters to the 800 series and I am noticing quite a bit of issues
regarding sr-iov. i'm using between 8 and 128 VFs, depending on the
machine.
I am currently using ubuntu 22.04.3 with linux-image-generic-hwe-22.04
6.5.0.14.14~22.04.7 and debian testing with 6.5.0-4-amd6
NVM is 4.30 on the debian testing machine and 4.20 on the ubuntu system.
I see the same issues with 4.40 so I haven't bothered to upgrade (yet).
I see this on pretty much all versions and similar issues with the out of
tree driver. There are two different types of errors and they are both
below.
The first is when link happens to flap on a port (switch reboot etc) I see
the following error:
[ 831.658768]
================================================================================
[ 831.668264] UBSAN: array-index-out-of-bounds in
/build/linux-hwe-6.5-q7NZ0T/linux-hwe-6.5-6.5.0/drivers/net/ethernet/intel/ice/ice_virtchnl.c:2020:45
[ 831.683305] workqueue: ice_service_task [ice] hogged CPU for >10000us 4
times, consider switching to WQ_UNBOUND
[ 831.683417] index 2 is out of range for type 'virtchnl_ether_addr [1]'
[ 831.690770] CPU: 13 PID: 2041 Comm: kworker/13:0 Not tainted
6.5.0-14-generic #14~22.04.1-Ubuntu
[ 831.690774] Hardware name: Supermicro
SYS-110D-8C-FRAN8TP/X12SDV-8C-SPT8F, BIOS 1.5 10/26/2023
[ 831.690777] Workqueue: ice ice_service_task [ice]
[ 831.690815] Call Trace:
[ 831.690817] <TASK>
[ 831.690819] dump_stack_lvl+0x48/0x70
[ 831.690826] dump_stack+0x10/0x20
[ 831.690829] __ubsan_handle_out_of_bounds+0xc6/0x110
[ 831.690835] ? __pfx_ice_vc_del_mac_addr+0x10/0x10 [ice]
[ 831.690891] ice_vc_handle_mac_addr_msg+0x1c6/0x1d0 [ice]
[ 831.690937] ice_vc_del_mac_addr_msg+0x10/0x20 [ice]
[ 831.690978] ice_vc_process_vf_msg+0x5a4/0x7f0 [ice]
[ 831.691020] __ice_clean_ctrlq+0x1cf/0x4f0 [ice]
[ 831.691056] ice_service_task+0x2cc/0x4a0 [ice]
[ 831.691092] process_one_work+0x23d/0x450
[ 831.691096] worker_thread+0x50/0x3f0
[ 831.691100] ? __pfx_worker_thread+0x10/0x10
[ 831.691103] kthread+0xef/0x120
[ 831.691107] ? __pfx_kthread+0x10/0x10
[ 831.691111] ret_from_fork+0x44/0x70
[ 831.691116] ? __pfx_kthread+0x10/0x10
[ 831.691119] ret_from_fork_asm+0x1b/0x30
[ 831.691123] </TASK>
[ 831.691125]
================================================================================
The second issue is on reboot:
[2771637.176795] ------------[ cut here ]------------
[2771637.187192] WARNING: CPU: 12 PID: 1 at kernel/irq/irqdomain.c:284
irq_domain_remove+0xd5/0x100
[2771637.201762] Modules linked in: irdma ice xt_nat nvidia_uvm(POE)
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
fscache netfs xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user
xfrm_algo xt_addrtype nft_compat br_netfilter bridge stp llc tun nf_tables
libcrc32c nfnetlink nvme_fabrics overlay binfmt_misc sch_fq tcp_bbr
nvidia_drm(POE) intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd
nvidia_modeset(POE) kvm_amd ipmi_ssif zfs(POE) kvm snd_hda_codec_hdmi
zunicode(POE) irqbypass zzstd(OE) ghash_clmulni_intel snd_hda_intel
sha512_ssse3 snd_intel_dspcfg snd_intel_sdw_acpi sha512_generic zlua(OE)
nls_ascii snd_hda_codec nls_cp437 zavl(POE) vfat snd_hda_core icp(POE)
snd_hwdep fat acpi_ipmi aesni_intel snd_pcm i40e zcommon(POE) crypto_simd
video snd_timer ib_uverbs znvpair(POE) snd wmi cryptd drm_shmem_helper
drm_kms_helper sp5100_tco rapl acpi_cpufreq pcspkr ib_core soundcore
i2c_algo_bit ccp joydev evdev ipmi_msghandler
[2771637.201863] button k10temp watchdog spl(OE) sg nvidia(POE) loop fuse
efi_pstore drm dm_mod configfs efivarfs x_tables autofs4 ext4 crc16 mbcache
jbd2 crc32c_generic hid_generic rndis_host usbhid cdc_ether iavf usbnet ses
hid sd_mod mii enclosure nvme nvme_core ahci xhci_pci libahci t10_pi
mpt3sas xhci_hcd crc32_pclmul raid_class crc64_rocksoft libata crc32c_intel
scsi_transport_sas crc64 tg3 crc_t10dif crct10dif_generic usbcore scsi_mod
crct10dif_pclmul libphy gnss crct10dif_common usb_common scsi_common
i2c_piix4 [last unloaded: ipmi_si]
[2771637.445556] CPU: 12 PID: 1 Comm: systemd-shutdow Tainted: P W
OE 6.5.0-4-amd64 #1 Debian 6.5.10-1
[2771637.465165] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.1
06/02/2021
[2771637.482150] RIP: 0010:irq_domain_remove+0xd5/0x100
[2771637.496525] Code: 21 89 e8 3e c9 52 00 eb b9 48 8b 7b 10 e8 83 a1 1c
00 48 89 df 5b e9 7a a1 1c 00 66 90 48 c7 05 6d a4 f8 02 00 00 00 00 eb 8a
<0f> 0b e9 4b ff ff ff 31 d2 48 c7 c6 38 97 89 88 48 c7 c7 68 95 21
[2771637.535041] RSP: 0018:ffffb7618005bb70 EFLAGS: 00010282
[2771637.550303] RAX: 0000000000000000 RBX: ffff962e285afa40 RCX:
ffff962e285ae0e8
[2771637.567618] RDX: ffff95c280391980 RSI: 0000000000000000 RDI:
ffffffff890d2ae0
[2771637.584962] RBP: ffff9636607740d0 R08: ffffffff88834fa5 R09:
0000000000000068
[2771637.602347] R10: 0000000000000000 R11: ffffb7618005bbb8 R12:
ffff95c46a9bd9c0
[2771637.619793] R13: ffff9636607740d0 R14: ffff96366077436c R15:
00000000fee1dead
[2771637.637280] FS: 00007f65ef77d500(0000) GS:ffff963fced00000(0000)
knlGS:0000000000000000
[2771637.655883] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2771637.672165] CR2: 00007f65ef481edc CR3: 00000002ea336000 CR4:
0000000000350ee0
[2771637.689947] Call Trace:
[2771637.702993] <TASK>
[2771637.715641] ? irq_domain_remove+0xd5/0x100
[2771637.730511] ? __warn+0x81/0x130
[2771637.744327] ? irq_domain_remove+0xd5/0x100
[2771637.759104] ? report_bug+0x171/0x1a0
[2771637.773324] ? handle_bug+0x3c/0x80
[2771637.787284] ? exc_invalid_op+0x17/0x70
[2771637.801422] ? asm_exc_invalid_op+0x1a/0x20
[2771637.815743] ? irq_domain_remove+0xd5/0x100
[2771637.829921] msi_remove_device_irq_domain+0x66/0xc0
[2771637.844756] msi_device_data_release+0x18/0x60
[2771637.859108] release_nodes+0x40/0xb0
[2771637.872591] devres_release_all+0x8c/0xc0
[2771637.886559] device_unbind_cleanup+0xe/0x70
[2771637.900732] device_release_driver_internal+0x1cc/0x200
[2771637.915911] pci_stop_bus_device+0x6c/0x90
[2771637.929945] pci_stop_and_remove_bus_device+0x12/0x20
[2771637.944956] pci_iov_remove_virtfn+0xd5/0x140
[2771637.959127] sriov_disable+0x34/0xe0
[2771637.972288] ice_free_vfs+0x29a/0x2b0 [ice]
[2771637.985856] ? acpi_unregister_gsi_ioapic+0x2e/0x40
[2771637.999890] ? srso_return_thunk+0x5/0x10
[2771638.012804] ? acpi_pci_irq_disable+0x79/0xc0
[2771638.025849] ice_remove+0x216/0x220 [ice]
[2771638.038333] ice_shutdown+0x1a/0x50 [ice]
[2771638.050541] pci_device_shutdown+0x38/0x60
[2771638.062561] device_shutdown+0x118/0x1e0
[2771638.074166] kernel_restart+0x3a/0x90
[2771638.085274] __do_sys_reboot+0x142/0x230
[2771638.096410] do_syscall_64+0x60/0xc0
[2771638.106924] ? srso_return_thunk+0x5/0x10
[2771638.117632] ? exit_to_user_mode_prepare+0x40/0x1e0
[2771638.129017] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[2771638.140440] RIP: 0033:0x7f65ef31c553
[2771638.150201] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00
00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05
<48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 91 48 0d 00 f7 d8
[2771638.181641] RSP: 002b:00007ffeff467d38 EFLAGS: 00000202 ORIG_RAX:
00000000000000a9
[2771638.195540] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
00007f65ef31c553
[2771638.208948] RDX: 0000000001234567 RSI: 0000000028121969 RDI:
00000000fee1dead
[2771638.222318] RBP: 0000000000000000 R08: 0000000000000069 R09:
0000000000000000
[2771638.235638] R10: 0000000000000000 R11: 0000000000000202 R12:
0000000000000000
[2771638.248841] R13: 0000000000000000 R14: 00007ffeff468088 R15:
0000000000000000
[2771638.261970] </TASK>
[2771638.270048] ---[ end trace 0000000000000000 ]---