I've been troubleshooting a kernel panic we've seen in our production environment. First the kernel panic.
------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:3254! invalid opcode: 0000 [#1] SMP Modules linked in: zram vhost_vsock vmw_vsock_virtio_transport_common vsock nfnetlink_queue nfnetlink_log bluetooth iptable_nat xfs nf_conntrack_netlink nfnetlink ufs act_police cls_basic sch_ingress ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables x_tables vport_stt(OE) openvswitch(OE) nf_nat_ipv6 nf_nat_ipv4 nf_nat udp_tunnel dm_crypt ipmi_ssif bonding ipmi_devintf nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 dcdbas intel_rapl nf_defrag_ipv4 sb_edac edac_core nf_conntrack x86_pkg_temp_thermal intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel dm_multipath aesni_intel aes_x86_64 lrw glue_helper ablk_helper kvm_intel cryptd intel_cstate intel_rapl_perf kvm irqbypass mei_me ipmi_si vhost_net mei lpc_ich ipmi_msghandler shpchp vhost acpi_power_meter macvtap mac_hid macvlan coretemp lp parport btrfs raid456 async_raid6_recov async_memcpy asyn crc32c raid0 multipath linear raid1 raid10 ses enclosure scsi_transport_sas sfc(OE) mtd ptp ahci pps_core libahci mdio wmi megaraid_sas(OE) fjes [last unloaded: zram] CPU: 10 PID: 39947 Comm: CPU 0/KVM Tainted: G OE K 4.9.77-1-generic #4 Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.3.6 06/03/2015 task: ffff9b01ed1eab80 task.stack: ffffa7c0a2b04000 RIP: 0010:[<ffffffffc0734e17>] [<ffffffffc0734e17>] skb_segment+0xce7/0xed0 RSP: 0018:ffff9b237f943618 EFLAGS: 00010246 RAX: 00000000000089d5 RBX: ffff9b107c430f00 RCX: ffff9b107c431800 RDX: ffff9b22a5ab0d00 RSI: 00000000000060e2 RDI: 0000000000000440 RBP: ffff9b237f9436e8 R08: 00000000000060e2 R09: 000000000000626a R10: 0000000000005ca2 R11: 0000000000000000 R12: ffff9b11279396c0 R13: ffff9b5360ff5500 R14: 00000000000060e2 R15: 0000000000000011 FS: 00007f557e58f700(0000) GS:ffff9b237f940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1d5f7d30f2 CR3: 00000006e4f9a000 CR4: 0000000000162670 Stack: ffff9b107c431800 ffffffffffffffde fffffff400000000 ffff9b107c431800 ffff9b00c625bdf0 00005b2e01b39740 0000000000000001 0000000000000088 0000000000b39740 0000000000000022 0000000000000009 ffff9b5300000000 Call Trace: <IRQ> [<ffffffff94bc6137>] udp4_ufo_fragment+0x127/0x1a0 [<ffffffff94bcf32d>] inet_gso_segment+0x16d/0x3c0 [<ffffffff94b5293a>] skb_mac_gso_segment+0xaa/0x110 [<ffffffff94b52a66>] __skb_gso_segment+0xc6/0x190 [<ffffffff946760d0>] ? ep_read_events_proc+0xc0/0xc0 [<ffffffffc0665b3f>] queue_gso_packets+0x7f/0x1b0 [openvswitch] [<ffffffffc069d88d>] ? udp_error+0x16d/0x1c0 [nf_conntrack] [<ffffffffc0695282>] ? nf_ct_get_tuple+0x82/0xa0 [nf_conntrack] [<ffffffffc069d910>] ? udp_packet+0x30/0x90 [nf_conntrack] [<ffffffffc066dabc>] ? flow_lookup.isra.6+0x7c/0xb0 [openvswitch] [<ffffffffc0697d95>] ? nf_conntrack_in+0x2d5/0x560 [nf_conntrack] [<ffffffffc0665dc1>] ovs_dp_upcall+0x31/0x60 [openvswitch] [<ffffffffc0665ef3>] ovs_dp_process_packet+0x103/0x120 [openvswitch] [<ffffffffc065f2d4>] do_execute_actions+0x834/0x1510 [openvswitch] [<ffffffffc066dabc>] ? flow_lookup.isra.6+0x7c/0xb0 [openvswitch] [<ffffffffc065fff3>] ovs_execute_actions+0x43/0x110 [openvswitch] [<ffffffffc0665e76>] ovs_dp_process_packet+0x86/0x120 [openvswitch] [<ffffffffc0670040>] ? netdev_port_receive+0x100/0x100 [openvswitch] [<ffffffffc066f576>] ovs_vport_receive+0x76/0xd0 [openvswitch] [<ffffffff94b4fc3c>] ? netif_rx+0x1c/0x70 [<ffffffffc06703ec>] ? ovs_ip_tunnel_rcv+0x8c/0xe0 [openvswitch] [<ffffffff94b8ae2b>] ? nf_iterate+0x5b/0x70 [<ffffffffc0672888>] ? nf_ip_hook+0x738/0xde0 [openvswitch] [<ffffffff94b91df9>] ? ip_rcv_finish+0x129/0x420 [<ffffffff94b8ae9b>] ? nf_hook_slow+0x5b/0xa0 [<ffffffffc066fff0>] netdev_port_receive+0xb0/0x100 [openvswitch] [<ffffffffc0670040>] ? netdev_port_receive+0x100/0x100 [openvswitch] [<ffffffffc0670078>] netdev_frame_hook+0x38/0x60 [openvswitch] [<ffffffff94b501b0>] __netif_receive_skb_core+0x220/0xac0 [<ffffffffc028c1e0>] ? efx_fast_push_rx_descriptors+0x50/0x310 [sfc] [<ffffffff94b50a68>] __netif_receive_skb+0x18/0x60 [<ffffffff94b51b99>] process_backlog+0x89/0x140 [<ffffffff94b511ac>] net_rx_action+0x10c/0x360 [<ffffffff94c6eb0f>] __do_softirq+0xdf/0x2bb [<ffffffffc0285642>] ? efx_ef10_msi_interrupt+0x62/0x70 [sfc] [<ffffffff94c6dc3b>] do_IRQ+0x8b/0xd0 [<ffffffff94487816>] irq_exit+0xb6/0xc0 [<ffffffff94c6b956>] common_interrupt+0x96/0x96 <EOI> [<ffffffff94c6b798>] ? irq_entries_start+0x578/0x6a0 [<ffffffffc07b367b>] ? vmx_handle_external_intr+0x5b/0x60 [kvm_intel] [<ffffffffc052fe86>] vcpu_enter_guest+0x396/0x1290 [kvm] [<ffffffffc0536e07>] kvm_arch_vcpu_ioctl_run+0xb7/0x3d0 [kvm] [<ffffffffc051c6cf>] kvm_vcpu_ioctl+0x2af/0x570 [kvm] [<ffffffff94508362>] ? do_futex+0xb2/0x520 [<ffffffff94641bb9>] do_vfs_ioctl+0x99/0x5f0 [<ffffffffc052c6bf>] ? kvm_on_user_return+0x6f/0xa0 [kvm] [<ffffffff94642189>] SyS_ioctl+0x79/0x90 [<ffffffff94c6aee4>] entry_SYSCALL_64_fastpath+0x24/0xcf Code: 89 87 e0 00 00 00 49 8b 57 60 48 8b 43 60 48 89 53 60 49 89 47 60 49 8b 57 18 48 8b 43 18 48 89 53 18 49 89 47 18 e9 fa fb ff ff <0f> 0b 44 89 ee 48 89 df e8 6c 9a 40 d4 85 c0 0f 84 78 fe ff ff RIP [<ffffffffc0734e17>] skb_segment+0xce7/0xed0 RSP <ffff9b237f943618> ---[ end trace f0d2cc8df9be8c23 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x13400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Rebooting in 10 seconds.. ACPI MEMORY or I/O RESET_REG. We are running a 4.9.77 kernel with one patch backported. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/core/skbuff.c?h=v4.17-rc6&id=13acc94eff122b260a387d68668bf9d670738e6a This patch is to a fix a different kernel panic using STT, however the panic is still reproducible without this patch applied. We are using stock Open vSwitch 2.7.3. The panic is very reproducible, but it does require some configuration. In our case, we have two hosts acting as hypervisors. Each host has one guest VM. An STT tunnel is setup between the two hosts attached to each guest. One guest will act as a source and one will act as a destination. The destination has connection tracking setup in the flows. We have a script running `ovs-dpctl del-flows` in a loop to make reproducing the crash easier, but it's not strictly necessary. (This is just to make it easier for an upcall to occur, see below) The source guest then sends a couple of very large (>60k) UDP packets. The destination host then crashes with the above panic. The crash is a result of an skb that is not understood by skb_segment in net/core/skbuff.c. The skb comes from the solarflare NIC as a large skb, requiring the use of frag_list. It looks something like this (note the use of frag_list): skb: ffff92544a0bf000 len: 60177, data_len: 60169, nr_frags: 17 frag_list: ffff92544a0bed00 skb: ffff92544a0bed00 len: 24820, data_len: 24820, nr_frags: 17 next: ffff92544a0be700 skb: ffff92544a0be700 len: 10589, data_len: 10589, nr_frags: 8 It winds its way through the networking core and openvswitch (stripping off the outer STT encapsulation) eventually requiring an upcall. Since datapath/flow.c sets OVS_FRAG_TYPE_FIRST for any GSO UDP packet and connection tracking is setup, it will end up being passed into the networking core to be reassembled. https://github.com/openvswitch/ovs/blob/master/datapath/flow.c#L651 ip_frag_reasm in net/ipv4/ip_fragment.c will then change the skb into what appears to be a malformed form because the skb uses a frag_list. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/net/ipv4/ip_fragment.c?h=v4.9.101#n583 After ip_frag_reasm, the skb now looks like this: skb: ffff92544a0bf000 len: 60197, data_len: 60169, nr_frags: 17 frag_list: ffff92544a0bfe00 skb: ffff92544a0bfe00 len: 35409, data_len: 35409, nr_frags: 0 frag_list: ffff92544a0bed00 skb: ffff92544a0bed00 len: 24820, data_len: 24820, nr_frags: 17 next: ffff92544a0be700 skb: ffff92544a0be700 len: 10589, data_len: 10589, nr_frags: 8 There are two nested frag_list uses, with the newly introduced skb ffff92544a0bfe00 having nr_frags == 0. Eventually openvswitch will want to fragment the skb for the upcall, which ends up in skb_segment and finally crashes on the BUG_ON(!nfrags) (for skb ffff92544a0bfe00 in the examples above) It's not clear to me if the problem is in openvswitch or in the networking core. Why does openvswitch set OVS_FRAG_TYPE_FIRST for any skb with SKB_GSO_UDP set even if it's not a fragmented packet? Would it ever make sense for ip_frag_reasm to see an skb large enough to require the use of frag_list? JE _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
