Hi Ilya, Sorry for the late follow-up. Your patch has now run without crashes in my setup for 36 hours and I'm unable to re-trigger a panic.
> But the operation in the diff above is kind of pointless as the memcpy > itself will copy the value again. Right. So it would probably need at least a comment, if that ends up becoming the final fix. I'm not a kernel dev and I don't know what is the right thing to do here. Happy to test alternative patches if needed. BR Johan Den man. 8. jun. 2026 kl. 11.41 skrev Ilya Maximets <[email protected]>: > > On 6/8/26 10:25 AM, Johan Thomsen wrote: > > Hello, > > > > I am seeing what looks like a kernel bug in the Geneve/OVS/vhost > > transmit path on a Talos Linux node running Kube-ovn with Geneve > > overlay and KubeVirt VM traffic. > > > > Environment: > > > > Kernel: 6.18.33-talos > > Distro: Talos v1.13.3 > > > > Compiler/config: > > > > CONFIG_CC_VERSION_TEXT="clang version 22.1.2" > > CONFIG_CC_IS_CLANG=y > > CONFIG_LTO=y > > CONFIG_LTO_CLANG=y > > CONFIG_LTO_CLANG_THIN=y > > CONFIG_FORTIFY_SOURCE=y > > > > Hardware: HPE ProLiant DL325 Gen11, AMD EPYC > > > > NIC driver: bnxt_en > > > > Workload/network: > > > > Kube-OVN, Geneve overlay > > Open vSwitch datapath > > KubeVirt/QEMU VM traffic via vhost/tap > > > > Relevant console output: > > > > [ 648.742603] memcpy: detected buffer overflow: 104 byte write of > > buffer size 96 > > [ 648.749907] WARNING: CPU: 61 PID: 27020 at > > lib/string_helpers.c:1036 __fortify_report+0x45/0x60 > > [ 648.758689] Modules linked in: dm_round_robin dm_multipath lpfc > > nvmet_fc nvmet intel_rapl_msr intel_rapl_common ahci nvme_auth bnxt_en > > nvme hpilo hkdf libahci sp5100_tco watchdog k10temp > > [ 648.775429] CPU: 61 UID: 107 PID: 27020 Comm: vhost-27002 Not > > tainted 6.18.29-talos #1 PREEMPT(none) > > [ 648.784735] Hardware name: HPE ProLiant DL325 Gen11/ProLiant DL325 > > Gen11, BIOS 2.84 11/05/2025 > > [ 648.890478] skb_tunnel_info_unclone+0x179/0x190 > > [ 648.895152] geneve_xmit+0x7fe/0xe00 > > [ 648.907240] dev_hard_start_xmit+0xa7/0x1f0 > > [ 648.911479] __dev_queue_xmit+0x864/0xf40 > > [ 648.919688] do_execute_actions+0x9b9/0x1be0 > > [ 648.927727] ovs_execute_actions+0x58/0x170 > > [ 648.931960] ovs_dp_process_packet+0xb1/0x1c0 > > [ 648.936370] ovs_vport_receive+0x90/0x100 > > [ 648.940428] netdev_frame_hook+0x146/0x1a0 > > [ 648.954093] __netif_receive_skb+0x3f/0x160 > > [ 648.958324] process_backlog+0x10c/0x210 > > [ 648.962295] __napi_poll+0x2f/0x190 > > [ 648.965832] net_rx_action+0x2e3/0x500 > > [ 648.969632] handle_softirqs+0xe7/0x310 > > [ 648.985387] tun_get_user+0x137e/0x1510 > > [ 649.005878] handle_tx+0x41f/0xd30 > > [ 649.029014] vhost_run_work_list+0x52/0x90 > > [ 649.033162] vhost_task_fn+0xc2/0x140 > > [ 649.064145] ---[ end trace 0000000000000000 ]--- > > [ 649.068820] ------------[ cut here ]------------ > > [ 649.073489] kernel BUG at lib/string_helpers.c:1043! > > > > I don't know whether this is a real overflow or a FORTIFY false-positive. > > Looks like a false-positive from the __counted_by fortification. > > I'd guess something like this would fit it: > > diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h > index 1fc2fb03ce3f9..e51c3795da474 100644 > --- a/include/net/dst_metadata.h > +++ b/include/net/dst_metadata.h > @@ -164,6 +164,7 @@ static inline struct metadata_dst *tun_dst_unclone(struct > sk_buff *skb) > if (!new_md) > return ERR_PTR(-ENOMEM); > > + new_md->u.tun_info.options_len = md_size; > memcpy(&new_md->u.tun_info, &md_dst->u.tun_info, > sizeof(struct ip_tunnel_info) + md_size); > #ifdef CONFIG_DST_CACHE > --- > > Johan, could you try this in your setup? > > The memory was actually allocated for the options, but the structure > is zeroed out on allocation, so the __counted_by check doesn't work > properly for the initial initialization copy. > > But the operation in the diff above is kind of pointless as the memcpy > itself will copy the value again. So, I'm not sure if that's the right > solution here. > > Alternative might be to revert the kmalloc_flex back to the simple > kmalloc in metadata_dst_alloc. > > CC: Kees > > Best regards, Ilya Maximets. > > > > > I cannot reproduce the issue on Talos v1.12.X which uses a gcc built > > kernel, whereas the affected kernel is built with clang. Don't know > > whether this is relevant here. > > > > I am currently trying to make a reliable reproducer, but I can almost > > always trigger the issue when iperf stressing the VM-network. > > > > Please let me know if this should go to a more specific > > maintainer/list or further info is needed. I am able to test candidate > > patches if provided. > > > > Downstream bug reports: > > https://github.com/siderolabs/talos/issues/13440 > > https://github.com/kubeovn/kube-ovn/issues/6767 > > > > Thanks, > > Johan > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
