On 1/18/24 19:19, Paul Saab wrote:
Hello,

I'm migrating some systems that I used to use the 700 series network adapters to the 800 series and I am noticing quite a bit of issues regarding sr-iov.  i'm using between 8 and 128 VFs, depending on the machine.

I am currently using ubuntu 22.04.3 with linux-image-generic-hwe-22.04       6.5.0.14.14~22.04.7 and debian testing with 6.5.0-4-amd6

NVM is 4.30 on the debian testing machine and 4.20 on the ubuntu system.
I see the same issues with 4.40 so I haven't bothered to upgrade (yet).

I see this on pretty much all versions and similar issues with the out of tree driver.  There are two different types of errors and they are both below.

Hi,

Thank you for the report, and for the extra effort to check with
different versions/variants of the driver.


The first is when link happens to flap on a port (switch reboot etc) I see the following error: [  831.658768] ================================================================================ [  831.668264] UBSAN: array-index-out-of-bounds in /build/linux-hwe-6.5-q7NZ0T/linux-hwe-6.5-6.5.0/drivers/net/ethernet/intel/ice/ice_virtchnl.c:2020:45 [  831.683305] workqueue: ice_service_task [ice] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[  831.683417] index 2 is out of range for type 'virtchnl_ether_addr [1]'

Here UBSAN is complaining about our improper `type array[1];` style of
declaring flex arrays at the end of structs (that we ended up with
thanks to implementing virtchnl as a "shared code, for MS Windows too").

Definitely worth a fix, but not something to be alarmed about.
Thankfully, Olek already fixed that in recent commit 5e7f59fa07f8 ("virtchnl: fix fake 1-elem arrays in structures allocated as `nents + 1`")

[snip]


The second issue is on reboot:
[2771637.176795] ------------[ cut here ]------------
[2771637.187192] WARNING: CPU: 12 PID: 1 at kernel/irq/irqdomain.c:284

this looks awfully similar to the one of other bugs with recent fixes,
7ae42ef308ed ("iavf: Fix iavf_shutdown to call iavf_remove instead iavf_close"), and I had heard from Red Hat that this commit helps a lot
in their case.

Both of the commits are not yet present in Ubuntu jammy (22.04) hwe 6.5
kernel (at least in my copy).

Who from Canonical should we ask for putting those at a fast path to
backport?

Best Regards,
Przemek

(keeping the splat below)

irq_domain_remove+0xd5/0x100
[2771637.201762] Modules linked in: irdma ice xt_nat nvidia_uvm(POE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache netfs xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat br_netfilter bridge stp llc tun nf_tables libcrc32c nfnetlink nvme_fabrics overlay binfmt_misc sch_fq tcp_bbr nvidia_drm(POE) intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd nvidia_modeset(POE) kvm_amd ipmi_ssif zfs(POE) kvm snd_hda_codec_hdmi zunicode(POE) irqbypass zzstd(OE) ghash_clmulni_intel snd_hda_intel sha512_ssse3 snd_intel_dspcfg snd_intel_sdw_acpi sha512_generic zlua(OE) nls_ascii snd_hda_codec nls_cp437 zavl(POE) vfat snd_hda_core icp(POE) snd_hwdep fat acpi_ipmi aesni_intel snd_pcm i40e zcommon(POE) crypto_simd video snd_timer ib_uverbs znvpair(POE) snd wmi cryptd drm_shmem_helper drm_kms_helper sp5100_tco rapl acpi_cpufreq pcspkr ib_core soundcore i2c_algo_bit ccp joydev evdev ipmi_msghandler [2771637.201863]  button k10temp watchdog spl(OE) sg nvidia(POE) loop fuse efi_pstore drm dm_mod configfs efivarfs x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic rndis_host usbhid cdc_ether iavf usbnet ses hid sd_mod mii enclosure nvme nvme_core ahci xhci_pci libahci t10_pi mpt3sas xhci_hcd crc32_pclmul raid_class crc64_rocksoft libata crc32c_intel scsi_transport_sas crc64 tg3 crc_t10dif crct10dif_generic usbcore scsi_mod crct10dif_pclmul libphy gnss crct10dif_common usb_common scsi_common i2c_piix4 [last unloaded: ipmi_si] [2771637.445556] CPU: 12 PID: 1 Comm: systemd-shutdow Tainted: P  W  OE      6.5.0-4-amd64 #1  Debian 6.5.10-1 [2771637.465165] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.1 06/02/2021
[2771637.482150] RIP: 0010:irq_domain_remove+0xd5/0x100
[2771637.496525] Code: 21 89 e8 3e c9 52 00 eb b9 48 8b 7b 10 e8 83 a1 1c 00 48 89 df 5b e9 7a a1 1c 00 66 90 48 c7 05 6d a4 f8 02 00 00 00 00 eb 8a <0f> 0b e9 4b ff ff ff 31 d2 48 c7 c6 38 97 89 88 48 c7 c7 68 95 21
[2771637.535041] RSP: 0018:ffffb7618005bb70 EFLAGS: 00010282
[2771637.550303] RAX: 0000000000000000 RBX: ffff962e285afa40 RCX: ffff962e285ae0e8 [2771637.567618] RDX: ffff95c280391980 RSI: 0000000000000000 RDI: ffffffff890d2ae0 [2771637.584962] RBP: ffff9636607740d0 R08: ffffffff88834fa5 R09: 0000000000000068 [2771637.602347] R10: 0000000000000000 R11: ffffb7618005bbb8 R12: ffff95c46a9bd9c0 [2771637.619793] R13: ffff9636607740d0 R14: ffff96366077436c R15: 00000000fee1dead [2771637.637280] FS:  00007f65ef77d500(0000) GS:ffff963fced00000(0000) knlGS:0000000000000000
[2771637.655883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2771637.672165] CR2: 00007f65ef481edc CR3: 00000002ea336000 CR4: 0000000000350ee0
[2771637.689947] Call Trace:
[2771637.702993]  <TASK>
[2771637.715641]  ? irq_domain_remove+0xd5/0x100
[2771637.730511]  ? __warn+0x81/0x130
[2771637.744327]  ? irq_domain_remove+0xd5/0x100
[2771637.759104]  ? report_bug+0x171/0x1a0
[2771637.773324]  ? handle_bug+0x3c/0x80
[2771637.787284]  ? exc_invalid_op+0x17/0x70
[2771637.801422]  ? asm_exc_invalid_op+0x1a/0x20
[2771637.815743]  ? irq_domain_remove+0xd5/0x100
[2771637.829921]  msi_remove_device_irq_domain+0x66/0xc0
[2771637.844756]  msi_device_data_release+0x18/0x60
[2771637.859108]  release_nodes+0x40/0xb0
[2771637.872591]  devres_release_all+0x8c/0xc0
[2771637.886559]  device_unbind_cleanup+0xe/0x70
[2771637.900732]  device_release_driver_internal+0x1cc/0x200
[2771637.915911]  pci_stop_bus_device+0x6c/0x90
[2771637.929945]  pci_stop_and_remove_bus_device+0x12/0x20
[2771637.944956]  pci_iov_remove_virtfn+0xd5/0x140
[2771637.959127]  sriov_disable+0x34/0xe0
[2771637.972288]  ice_free_vfs+0x29a/0x2b0 [ice]
[2771637.985856]  ? acpi_unregister_gsi_ioapic+0x2e/0x40
[2771637.999890]  ? srso_return_thunk+0x5/0x10
[2771638.012804]  ? acpi_pci_irq_disable+0x79/0xc0
[2771638.025849]  ice_remove+0x216/0x220 [ice]
[2771638.038333]  ice_shutdown+0x1a/0x50 [ice]
[2771638.050541]  pci_device_shutdown+0x38/0x60
[2771638.062561]  device_shutdown+0x118/0x1e0
[2771638.074166]  kernel_restart+0x3a/0x90
[2771638.085274]  __do_sys_reboot+0x142/0x230
[2771638.096410]  do_syscall_64+0x60/0xc0
[2771638.106924]  ? srso_return_thunk+0x5/0x10
[2771638.117632]  ? exit_to_user_mode_prepare+0x40/0x1e0
[2771638.129017]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[2771638.140440] RIP: 0033:0x7f65ef31c553
[2771638.150201] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 91 48 0d 00 f7 d8 [2771638.181641] RSP: 002b:00007ffeff467d38 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9 [2771638.195540] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f65ef31c553 [2771638.208948] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead [2771638.222318] RBP: 0000000000000000 R08: 0000000000000069 R09: 0000000000000000 [2771638.235638] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [2771638.248841] R13: 0000000000000000 R14: 00007ffeff468088 R15: 0000000000000000
[2771638.261970]  </TASK>
[2771638.270048] ---[ end trace 0000000000000000 ]---

Reply via email to