On Sat, Jun 6, 2026 at 1:26 AM Numan Siddique <[email protected]> wrote:
> > > On Fri, Jun 5, 2026 at 10:16 AM Eelco Chaudron <[email protected]> > wrote: > >> >> >> On 2 Jun 2026, at 22:50, [email protected] wrote: >> >> > From: Numan Siddique <[email protected]> >> > >> > Hello, >> > >> > Below is a side-by-side trace of the same OVN-driven datapath pipeline, >> > In our prod deployment we are seeing intermittent offload issues. All >> > the datapath flows of a chain are getting offloaded except the last one. >> > It is installed in the kernel dp because of which it kills the >> performance. >> > If I run the command - ovs-appctl dpctl/del-flows, the problematic >> > flow gets offloaded. >> > >> > The issue again can be reproduced if we run a script to the delete >> > the dp flows in a loop with a sleep of 5 seconds. Generally the issue >> > gets surfaced after 5-6 del flows. >> > >> > Below are the datapath flows when the issue is seen >> > >> > The traffic is destined to the public IP(s) (there are 2 public ips in >> > our setup) of the VM and enters the compute node PF and via the br-ex >> > to the OVN pipeline. >> > >> > >> ------------------------------------------------------------------------------- >> > 1. BEFORE FLUSH >> > >> ------------------------------------------------------------------------------- >> > >> > Two upstream branches merge into the stranded chain `0x314610`. Each >> > branch is exactly two TC-offloaded stages followed by the umbrella that >> > is stuck in dp:ovs. >> > >> > +------------------------------+ >> +------------------------------+ >> > | recirc_id(0) BRANCH A | | recirc_id(0) BRANCH B >> | >> > | ufid:b81b9ab4 | | ufid:29ac3fbf >> | >> > | in_port=enp210s0f0np0 (PF) | | in_port=enp210s0f0np0 (PF) >> | >> > | eth_type=0x8100, VLAN 120 | | eth_type=0x8100, VLAN 26 >> | >> > | eth(src=b0:cf:0e:b1:31:ff, | | eth(src=b0:cf:0e:b1:31:ff, >> | >> > | dst=ae:ad:c9:2a:9d:0f) | | dst=ae:ad:c9:2a:9d:0f) >> | >> > | ipv4(dst=AA.BB.CC.DD, | | ipv4(dst=XX.YY.ZZ.AA, >> | >> > | src=32.0.0.0/224.0.0.0, | | src=8.0.0.0/248.0.0.0, >> | >> > | ttl=119) | | ttl=62) >> | >> > | ct_state(0/0x2b) | | ct_state(0/0x2b) >> | >> > | ct_mark(0/0x2) | | ct_mark(0/0x2) >> | >> > | | | >> | >> > | actions: | | actions: >> | >> > | pop_vlan, | | pop_vlan, >> | >> > | ct(zone=6,nat), | | ct(zone=5,nat), >> | >> > | recirc(0x320213) | | recirc(0x321f17) >> | >> > | | | >> | >> > | pkts=565 bytes=44636 | | pkts=26,867,289 >> | >> > | used=1.640s | | bytes=241,881,487,034 >> | >> > | offloaded:yes, dp:tc | | used=0.620s >> | >> > +------------------------------+ | offloaded:yes, dp:tc >> | >> > | >> +------------------------------+ >> > | post-DNAT in zone 6 | >> > v | post-DNAT in >> zone 5 >> > +------------------------------+ v >> > | recirc_id(0x320213) | >> +------------------------------+ >> > | ufid:3413a279 | | recirc_id(0x321f17) >> | >> > | in_port=enp210s0f0np0 (PF) | | ufid:9f638cd3 >> | >> > | ct_state(0x2a/0x3e) | | in_port=enp210s0f0np0 (PF) >> | >> > | ct_mark(0/0x43) | | ct_state(0x2a/0x3e) >> | >> > | eth(src=b0:cf:0e:b1:31:ff, | | ct_mark(0/0x43) >> | >> > | dst=ae:ad:c9:2a:9d:0f) | | eth(src=b0:cf:0e:b1:31:ff, >> | >> > | ipv4(src=0.0.0.0/128.0.0.0, | | dst=ae:ad:c9:2a:9d:0f) >> | >> > | dst=172.27.61.7, | | ipv4(src=8.0.0.0/248.0.0.0, >> | >> > | proto=6, ttl=119) | | dst=172.27.61.7, >> | >> > | | | proto=6, ttl=62) >> | >> > | actions: | | >> | >> > | ct_clear, | | actions: >> | >> > | set(eth src= | | ct_clear, >> | >> > | 1a:83:58:7b:a8:ed), | | set(eth src= >> | >> > | set(ipv4 ttl=118), | | 1a:83:58:7b:a8:ed), >> | >> > | ct(zone=11,nat), | | set(ipv4 ttl=60), >> | >> > | recirc(0x314610) | | ct(zone=11,nat), >> | >> > | | | recirc(0x314610) >> | >> > | pkts=565 bytes=44636 | | >> | >> > | used=1.640s | | pkts=26,867,289 >> | >> > | offloaded:yes, dp:tc | | bytes=241,881,487,034 >> | >> > +------------------------------+ | used=0.620s >> | >> > | | offloaded:yes, dp:tc >> | >> > | >> +------------------------------+ >> > | | >> > +--------------+ +-------------+ >> > | | >> > v v >> > +--------------------------------------+ >> > | recirc_id(0x314610) STAGE 2 | >> > | ufid:1ee350bf | >> > | in_port=enp210s0f0np0 (PF) | >> > | ct_state(0x2a/0x3f) <-- mask 0x3f | >> > | ct_mark(0/0x41) | >> > | eth(src=*, dst=ae:ad:c9:2a:9d:0f) | >> > | ipv4(src=*, dst=172.27.61.7, | >> > | proto=0/0, ttl=0/0) | >> > | | >> > | actions: enp210s0f0_1 (VF) | >> > | | >> > | pkts=41,192,879 | >> > | bytes=2,502,536,363,732 | >> > | used=0.020s, flags=SFPR. | >> > | | >> > | dp:ovs <-- STRANDED, NOT OFFLOADED | >> > +--------------------------------------+ >> > >> > >> > >> ------------------------------------------------------------------------------- >> > 2. AFTER FLUSH (ovs-appctl dpctl/del-flows) >> > >> ------------------------------------------------------------------------------- >> > >> > After `ovs-appctl dpctl/del-flows` everything is re-installed in the >> > natural pipeline order, so the chain check passes for every stage. >> > The megaflow masks have not been re-aggregated yet, so we see a >> > "fanned out" pipeline: >> > >> > +----------------+ +----------------+ +----------------+ >> > | recirc_id(0) | | recirc_id(0) | | (parent for | >> > | BRANCH A | | BRANCH B | | chain | >> > | 5 sub-megaflows| | 1 megaflow | | 0x3229d9 had | >> > | vlan 120 | | vlan 26 | | aged out at | >> > | zone 6 NAT | | zone 5 NAT | | dump time -- | >> > | | | | | the two | >> > | dst= | | dst= | | stage-1 | >> > | AA.BB.CC.DD | | XX.YY.ZZ.AA | | flows below | >> > | by src/ttl: | | src=8.0.0.0/5 | | had pkts=0) | >> > | 104/5 ttl=56 | | ttl=62 | | | >> > | 32/3 ttl=119 | | | | ufid:1b6d210e | >> > | 124/7 ttl=234 | | pkts=14,326,765| | -- not | >> > | 32/3 ttl=122 | | bytes=128.7 GB | | captured | >> > | 192/3 ttl=243 | | used=0.660s | | for branch | >> > | | | | | C | >> > | actions: | | actions: | | | >> > | pop_vlan, | | pop_vlan, | | | >> > | ct(zone=6, | | ct(zone=5, | | | >> > | nat), | | nat), | | | >> > | recirc( | | recirc( | | | >> > | 0x320213) | | 0x321f17) | | | >> > | offloaded:yes | | offloaded:yes | | | >> > | dp:tc | | dp:tc | | | >> > +----------------+ +----------------+ +----------------+ >> > | | : >> > v v v >> > +----------------+ +----------------+ +----------------+ >> > | recirc_id | | recirc_id | | recirc_id | >> > | (0x320213) | | (0x321f17) | | (0x3229d9) | >> > | | | | | | >> > | 3 sub-megaflows| | 1 megaflow | | 2 megaflows | >> > | ct_state( | | ct_state( | | ct_state( | >> > | 0x2a/0x3e) | | 0x2a/0x3e) | | 0x21/0x3f) | >> > | (+est+rpl+trk) | | (+est+rpl+trk) | | (+new+trk) | >> > | | | | | | >> > | ttl 119 -> 118 | | ttl 62 -> 60 | | ttl 234 -> 233 | >> > | ttl 56 -> 55 | | | | ttl 243 -> 242 | >> > | ttl 122 -> 121 | | pkts=14,326,690| | | >> > | | | bytes=128.7 GB | | pkts=0 (new | >> > | pkts=68+9+1=78 | | used=0.660s | | conn attempts| >> > | | | | | in flight) | >> > | actions: | | actions: | | | >> > | ct_clear, | | ct_clear, | | actions: | >> > | set(eth src= | | set(eth src= | | (same shape | >> > | 1a:83:..), | | 1a:83:..), | | as branch | >> > | set(ipv4 ttl | | set(ipv4 ttl | | A/B stage 1) | >> > | -1), | | -1), | | recirc( | >> > | ct(zone=11, | | ct(zone=11, | | 0x314610) | >> > | nat), | | nat), | | | >> > | recirc( | | recirc( | | offloaded:yes | >> > | 0x314610) | | 0x314610) | | dp:tc | >> > | offloaded:yes | | offloaded:yes | | | >> > | dp:tc | | dp:tc | | | >> > +----------------+ +----------------+ +----------------+ >> > | | | >> > +--------+ | +------------+ >> > | | | >> > v v v >> > +-----------------------------------------+ >> > | recirc_id(0x314610) STAGE 2 | >> > | | >> > | Three flows now (all offloaded:yes, | >> > | dp:tc): | >> > | | >> > | 1. ufid:c51ef89d <-- the umbrella | >> > | ct_state(0x2a/0x3e) <-- mask 0x3e | >> > | ct_mark(0/0x41) | >> > | eth(src=*, dst=ae:ad:c9:2a:9d:0f) | >> > | ipv4(dst=172.27.61.7) | >> > | actions: enp210s0f0_1 (VF) | >> > | pkts=14,326,720 | >> > | bytes=128,674,265,194 | >> > | used=0.660s | >> > | | >> > | 2. ufid:d6f6c8c3 (DROP, new conn ACL) | >> > | ct_state(0x21/0x3f) (+new+trk) | >> > | eth(src=1a:83:58:7b:a8:ed, | >> > | dst=ae:ad:00:00:00:00/ | >> > | ff:ff:00:00:00:00) | >> > | dst=172.27.60.0/23, | >> > | tcp ports w/ submask | >> > | actions: drop | >> > | pkts=0 | >> > | | >> > | 3. ufid:0b52d8bd (DROP, new conn ACL) | >> > | same shape, different tcp submask | >> > | actions: drop | >> > | pkts=0 | >> > +-----------------------------------------+ >> > >> > >> > (Note: The above ascii graph is generated by Claude) >> > >> > >> > In the OVS logs we also see the below msg ( >> https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2363 >> ) >> > >> > ``` >> > 2026-06-01T21:15:33.774Z|10763|netdev_offload_tc(handler18)|DBG| >> > match for chain 3229200 failed due to non-existing goto chain action >> > ``` >> > >> > There seems to be a race condition during the ccmap 'used_chains'. >> > >> > As per Claude, the issue seems to be introduced in the commit : >> > `273a4fce951a`** — `netdev-offload-tc: Only install recirc flows if the >> parent is present.` >> > and there is a possibility of a race window in the function >> netdev_tc_flow_put() >> > between >> https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2695 >> > and >> https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2730 >> > >> > @Eelco @Ilya - Do you have any idea on what could be going on here ? >> >> Hi Numan, >> >> Sorry for the late response, but this message ended up in my >> spam box which I was cleaning up :( Put Ilya also on the TO >> line, maybe it ended up in his spam also. >> >> I'm on PTO on Monday, so will try to take a look at this later >> in the week. >> >> //Eelco >> > > > Hi Eelco, > > Thanks for the reply. > > Below are some of the details I found during my investigation. > > OVN Logical topology > ---- > > VM1 -> VPC Logical switch (with ACLs) -> Logical Router (NATs configured) > -> > Public logical switch with localnet port -> br-int <-> patch ports -> > br-ex -> Physical NIC > > In this case, there is an iperf session between the VM1 and outside iperf > server. > > When the reply traffic from iperf server to the VM 1 enters the host via > the physical NIC, > before the packet is delivered to the VM, there are in total 3 > recirculations > (stage 0, 1 and 2) because of NAT in the logical router pipeline and ACLs > in the logical switch > pipeline -> recirc 0, recirc 4 and recirc 5 (for example). > > When the datapath flow for recirc 0 is offloaded, recirc 4 is saved in the > "used_chains" > (in tc_netdev_flow_put()) and when the dp flow with recirc 4 is offloaded, > recirc 5 is saved > in the "used_chains". > > > What I noticed is that for some of the packets from the server to the VM, > tcp.flags has push/ack set > and for some only ack set. > > And ovs-vswitchd generates different ufid in both the cases, but after the > flow translation > same datapath flow is generated. > > Suppose if I delete the dp flows when iperf is running (ovs-appctl > dpctl/del-flows), > a lot of packets get upcalled. What I noticed is one set of reply packets > from server > has tcp.flags == push/ack and another set with just tcp.flags == ack. > > When the handler thread does the flow translation for packets with > [recirc=5, tcp.flags == push/ack] > it generates ufid 'A' and offloads the flow if '5' is present in > 'used_chains' > > And when the handler thread does the flow translation for packets with > [recirc=5, tcp.flags == ack] it generates ufid 'B' and when it tries to > offload, tc returns EEXISTS because both ufid 'A' and 'B' generates > the same datapath flow (as the datapath flow doesn't have matches for tcp > flags). > > > And I see the error message 'match for chain 5 failed due to non-existing > goto chain action' > if the stage 1 flow (with recirc 4) was deleted. > > I was able to fix this issue by hacking ovn-northd and adding the below > logical flow > > table=6 (ls_out_acl_eval ), priority=65533, match=(ct.est && !ct.rel && > ct.rpl && ct_mark.blocked == 0 && tcp && (tcp.flags == 24 || tcp.flags == > 16)), action=(reg8[21] = ct_label.nf; reg8[16] = 1; next;) > table=6 (ls_out_acl_eval ), priority=65532, match=(ct.est && !ct.rel && > ct.rpl && ct_mark.blocked == 0), action=(reg8[21] = ct_label.nf; reg8[16] > = 1; next;) > > Priority 65532 is the existing logical flow which northd adds to allow all > the established reply packets. > > Since priority 65533 flow matches on tcp.flags, now two distinct datapath > flows are added and I do not see this race issue even after deleting the > dpctl flows in a loop. (I added a sleep of 8 seconds between the deletes). > > Without the hack in northd, I'm able to reproduce the issue 100% of the > time within the first 3 dp flow deletes in a fake-multinode environment. > > When the issue is seen, the first 2 dp flows (recirc 0 and 4) are > offloaded to tc and the last one is not. > > > When 2 different ufids result in the same dp flow, perhaps the > tc_netdev_flow_put() should store both ufids in the ufid_to_tc_mappings ? > > Let me know if you want me to provide a script to reproduce using > ovn-fake-multinode. > Please ignore my hack theory. I deployed a simple fake multi node environment and started the iperf between the fake vm "sw01p1" in ovn-chassis-1 and the "ovnfake-ext1" namespace on the host. And I was able to reproduce the issue even with the hack and delete the dp flows. You can run the command - watch -n1 'ovs-appctl dpctl/dump-flows -m | grep "in_port(eth2"' in ovn-chassis-1 and notice that the dp flow changes from "dp:tc" to "dp:ovs". Below is my theory on why offload fails in our prod deployment T1. Packet P_A arrives (upcalled), hashed to ufid:A. xlate produces megaflow M1 with action set ending in recirc(R1). R1 = recirc_alloc_id_ctx(frozen-alpha). tc_netdev_flow_put(M1) fails (any reason -- unsupported action, EOPNOTSUPP from a probe, mlx5 driver capability mismatch, etc.). M1 lands in dp:ovs via dpif_netlink fallback. ccmap_inc(used_chains, R1) NEVER fires. T2. P_A continues through openvswitch.ko, internal recirc to R1, miss, upcall. Hashed to ufid:B. xlate produces megaflow M2 matching on recirc_id=R1, with some output action (e.g. mirred to a port). tc_netdev_flow_put(M2): chain check: ccmap_find(used_chains, R1) == 0 -> BAIL return EOPNOTSUPP. M2 lands in dp:ovs. T3. A different packet P_C arrives, hashed to ufid:C. xlate produces megaflow M3 with a DIFFERENT match key but arrives at the same OF freeze point as M1's translation did. Same frozen_state -> same R1. M3 install SUCCEEDS in TC. ccmap_inc(used_chains, R1) -> count = 1. T4. P_C continues, TC recircs to R1. TC chain R1 is empty (M2 is in dp:ovs, not TC). Packet falls to openvswitch.ko at recirc_id=R1. M2 matches; executes in SW. Net result: M3 (upstream sibling) is in dp:tc, M2 (downstream) is permanently in dp:ovs. P_C traffic does TC-fast-path on the upstream stage but kernel-SW on the downstream stage, every packet. Does this theory make sense ? Thanks Numan > Thanks > Numan > > >> >> > Let me know if you need more information. I'll try to debug further. >> > >> > Thanks >> > Numan >> >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
