Hello,
We got kernel crash in STT module in our could environment. I have made
root cause analysis and I happy to provide RCA report here:
Version reported: OVS v2.17.6
Kernel crash:
[ 120.574175] ------------[ cut here ]------------
[ 120.574178] kernel BUG at net/core/skbuff.c:122!
[ 120.574510] invalid opcode: 0000 [#1] SMP NOPTI
[ 120.574840] CPU: 4 PID: 40 Comm: ksoftirqd/4 Kdump: loaded Tainted:
G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
[ 120.575538] Hardware name: Dell Inc. PowerEdge R720/0X6FFV, BIOS
2.5.2 01/28/2015
[ 120.575916] RIP: 0010:skb_panic+0x4b/0x4d
[ 120.576288] Code: 00 00 50 8b 87 dc 00 00 00 50 8b 87 d8 00 00 00 50
ff b7 e8 00 00 00 4c 8b 8f e0 00 00 00 48 c7 c7 f8 0a 1a 9d e8 58 10 95
ff <0f> 0
b 48 8b 14 24 48 c7 c1 c0 ea ee 9c e8 a3 ff ff ff 48 c7 c6 00
[ 120.577132] RSP: 0018:ffffba2581b7f678 EFLAGS: 00010246
[ 120.577553] RAX: 000000000000008b RBX: ffff9087064a4b00 RCX:
0000000000000000
[ 120.577989] RDX: 0000000000000000 RSI: ffff9086f789e698 RDI:
ffff9086f789e698
[ 120.578428] RBP: 0000000000009ec2 R08: 0000000000000000 R09:
c0000000ffff7fff
[ 120.578877] R10: 0000000000000001 R11: ffffba2581b7f498 R12:
0000000000002f1d
[ 120.579576] R13: 0000000000000c20 R14: 0000000000000008 R15:
ffff9085c999a400
[ 120.580041] FS: 0000000000000000(0000) GS:ffff9086f7880000(0000)
knlGS:0000000000000000
[ 120.580512] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 120.580971] CR2: 00007f0eb711e340 CR3: 00000003b4410006 CR4:
00000000000606e0
[ 120.581445] Call Trace:
[ 120.581906] skb_push.cold.98+0x10/0x10
[ 120.582374] ovs_stt_xmit.cold.48+0x166/0x480 [openvswitch]
[ 120.582870] do_execute_actions+0xc51/0xc90 [openvswitch]
[ 120.583361] ? flow_lookup.isra.12+0x40/0xb0 [openvswitch]
Open vSwitch background to reproduce:
1. OVS action field in flow must have two 'output' commands executed on
the same packet and routed to different ports. This double output
results in input socket buffer will be cloned and used in subsequent
xmit calls. Here I provide part of code that will be executed in
do_execute_actions()
case OVS_ACTION_ATTR_OUTPUT: {
int port = nla_get_u32(a);
struct sk_buff *clone;
/* Every output action needs a separate clone
* of 'skb', In case the output action is the
* last action, cloning can be avoided.
*/
if (nla_is_last(a, rem)) {
do_output(dp, skb, port, key);
/* 'skb' has been used for output.
*/
return 0;
}
clone = skb_clone(skb, GFP_ATOMIC);
if (clone)
do_output(dp, clone, port, key);
2. Input packet must be split into at least two fragments.
3. STT tunneling must be in use.
Open vSwitch kernel module background to reproduce.
1. We must have input packet consisting two fragments:
Head part A (struct of sk_buff) and 2-nd part F (struct of sk_buff)
A->frag_list == F
2. We call B = skb_clone(A) so, we create clone of input sk_buff. Note,
that after clone second part of packet is not copied, thus A->flag_list
== B->frag_list
3. Call send action on both buffers, in fact, ovs_stt_xmit(A),
ovs_stt_xmit(B)
4. We get kernel crash.
Why it happens:
STT code wants to prepend STT header in face of incoming packet.
For input fragmented packet it wants to add STT header to every
fragment. Thus, every fragment supplied with STT header cannot be
enitire network packet anymore but rather set of individual packets.
Therefore STT rebuilds input packet detaching fragments and turn them
into individual packets. Technically say, it moves A->frag_list pointer
to A->next pointer. Then STT header is prepended to every element in skb
chain, (STT+)A, (STT+)A->next -- which is F.
For now a situation is not very cute but somehow correct.
Then we invoke same steps for B. B->frag_list still have pointer to
F which already has STT header prepended.
Running same way we fall into situation where B, B->next (==F) come
to the code that adds STT headers. As result we could potentially have
(STT+)B, (STT+STT+)B->next - which is F, so, second part of input packet
will be corrupted, but, hopefully (?), a second attempt to save STT
header alarmed memory availability guard code.
What is proposed to do:
Note that Geneve code does not process fragmented-case
individually. Logically say, we always have to prepend exactly ONE
tunnel header in front of entire packet, it does not matter how many
fragments it had on input. When such packet comes to output it shall be
reassembled and possibly fragmented again by ip code. Since I find no
reasons (except maybe historical reasons) to convert fragmented packet
to several individual STT packets I have made an attempt to just remove
STT packet defragmentation call, so, such like Geneve, STT adds only one
header in front of entire packet. After such change kernel crash gone
and all packets were transmitted correctly in out testing environment.
I append this mail with possible patch just in case you want to have a fix.
Special attention required.
STT code has installed NF hook that really need to reassemble
fragments to whole packet. I think a same or similar problem could
happen here under same conditions. I did not make efforts to investigate
this case. An only question I have is Do we really need to cooperate
with NF here?
=============== PATCH Here ==================
From d8543538dd08bc53c729c26d119d430dd16379f6 Mon Sep 17 00:00:00 2001
From: Aleksandr Smirnov <[email protected]>
Date: Fri, 1 Nov 2024 14:00:36 +0300
Subject: [PATCH] stt: fix kernel panic skb_push
---
datapath/linux/compat/stt.c | 16 ----------------
1 file changed, 16 deletions(-)
diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c
index 39a294764..7b1c7f3e7 100644
--- a/datapath/linux/compat/stt.c
+++ b/datapath/linux/compat/stt.c
@@ -663,22 +663,6 @@ static struct sk_buff *push_stt_header(struct
sk_buff *head, __be64 tun_id,
{
struct sk_buff *skb;
- if (skb_shinfo(head)->frag_list) {
- bool ipv4 = (l3_proto == htons(ETH_P_IP));
- bool tcp = (l4_proto == IPPROTO_TCP);
- bool csum_partial = (head->ip_summed == CHECKSUM_PARTIAL);
- int l4_offset = skb_transport_offset(head);
-
- /* Need to call skb_orphan() to report currect true-size.
- * calling skb_orphan() in this layer is odd but SKB with
- * frag-list should not be associated with any socket, so
- * skb-orphan should be no-op. */
- skb_orphan(head);
- if (unlikely(segment_skb(&head, csum_partial,
- ipv4, tcp, l4_offset)))
- goto error;
- }
-
for (skb = head; skb; skb = skb->next) {
if (__push_stt_header(skb, tun_id, s_port, d_port, saddr, dst,
l3_proto, l4_proto, dst_mtu))
--
2.47.0
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev