On Sat, Jun 6, 2026 at 1:26 AM Numan Siddique <[email protected]> wrote:

>
>
> On Fri, Jun 5, 2026 at 10:16 AM Eelco Chaudron <[email protected]>
> wrote:
>
>>
>>
>> On 2 Jun 2026, at 22:50, [email protected] wrote:
>>
>> > From: Numan Siddique <[email protected]>
>> >
>> > Hello,
>> >
>> > Below is a side-by-side trace of the same OVN-driven datapath pipeline,
>> > In our prod deployment we are seeing intermittent offload issues. All
>> > the datapath flows of a chain are getting offloaded except the last one.
>> > It is installed in the kernel dp because of which it kills the
>> performance.
>> > If I run the command - ovs-appctl dpctl/del-flows,  the problematic
>> > flow gets offloaded.
>> >
>> > The issue again can be reproduced if we run a script to the delete
>> > the dp flows in a loop with a sleep of 5 seconds.  Generally the issue
>> > gets surfaced after 5-6 del flows.
>> >
>> > Below are the datapath flows when the issue is seen
>> >
>> > The traffic is destined to the public IP(s) (there are 2 public ips in
>> > our setup) of the VM and enters the compute node PF and via the br-ex
>> > to the OVN pipeline.
>> >
>> >
>> -------------------------------------------------------------------------------
>> > 1. BEFORE FLUSH
>> >
>> -------------------------------------------------------------------------------
>> >
>> > Two upstream branches merge into the stranded chain `0x314610`.  Each
>> > branch is exactly two TC-offloaded stages followed by the umbrella that
>> > is stuck in dp:ovs.
>> >
>> >    +------------------------------+
>> +------------------------------+
>> >    | recirc_id(0)        BRANCH A |      | recirc_id(0)        BRANCH B
>> |
>> >    | ufid:b81b9ab4                |      | ufid:29ac3fbf
>> |
>> >    | in_port=enp210s0f0np0  (PF)  |      | in_port=enp210s0f0np0  (PF)
>> |
>> >    | eth_type=0x8100, VLAN 120    |      | eth_type=0x8100, VLAN 26
>>  |
>> >    | eth(src=b0:cf:0e:b1:31:ff,   |      | eth(src=b0:cf:0e:b1:31:ff,
>>  |
>> >    |     dst=ae:ad:c9:2a:9d:0f)   |      |     dst=ae:ad:c9:2a:9d:0f)
>>  |
>> >    | ipv4(dst=AA.BB.CC.DD,        |      | ipv4(dst=XX.YY.ZZ.AA,
>> |
>> >    |      src=32.0.0.0/224.0.0.0, |      |      src=8.0.0.0/248.0.0.0,
>> |
>> >    |      ttl=119)                |      |      ttl=62)
>>  |
>> >    | ct_state(0/0x2b)             |      | ct_state(0/0x2b)
>>  |
>> >    | ct_mark(0/0x2)               |      | ct_mark(0/0x2)
>>  |
>> >    |                              |      |
>> |
>> >    | actions:                     |      | actions:
>>  |
>> >    |   pop_vlan,                  |      |   pop_vlan,
>> |
>> >    |   ct(zone=6,nat),            |      |   ct(zone=5,nat),
>> |
>> >    |   recirc(0x320213)           |      |   recirc(0x321f17)
>>  |
>> >    |                              |      |
>> |
>> >    | pkts=565   bytes=44636       |      | pkts=26,867,289
>> |
>> >    | used=1.640s                  |      | bytes=241,881,487,034
>> |
>> >    | offloaded:yes, dp:tc         |      | used=0.620s
>> |
>> >    +------------------------------+      | offloaded:yes, dp:tc
>>  |
>> >                   |
>> +------------------------------+
>> >                   | post-DNAT in zone 6                  |
>> >                   v                                     | post-DNAT in
>> zone 5
>> >    +------------------------------+                     v
>> >    | recirc_id(0x320213)          |
>> +------------------------------+
>> >    | ufid:3413a279                |      | recirc_id(0x321f17)
>> |
>> >    | in_port=enp210s0f0np0  (PF)  |      | ufid:9f638cd3
>> |
>> >    | ct_state(0x2a/0x3e)          |      | in_port=enp210s0f0np0  (PF)
>> |
>> >    | ct_mark(0/0x43)              |      | ct_state(0x2a/0x3e)
>> |
>> >    | eth(src=b0:cf:0e:b1:31:ff,   |      | ct_mark(0/0x43)
>> |
>> >    |     dst=ae:ad:c9:2a:9d:0f)   |      | eth(src=b0:cf:0e:b1:31:ff,
>>  |
>> >    | ipv4(src=0.0.0.0/128.0.0.0,  |      |     dst=ae:ad:c9:2a:9d:0f)
>>  |
>> >    |      dst=172.27.61.7,        |      | ipv4(src=8.0.0.0/248.0.0.0,
>> |
>> >    |      proto=6, ttl=119)       |      |      dst=172.27.61.7,
>> |
>> >    |                              |      |      proto=6, ttl=62)
>> |
>> >    | actions:                     |      |
>> |
>> >    |   ct_clear,                  |      | actions:
>>  |
>> >    |   set(eth src=               |      |   ct_clear,
>> |
>> >    |        1a:83:58:7b:a8:ed),   |      |   set(eth src=
>>  |
>> >    |   set(ipv4 ttl=118),         |      |        1a:83:58:7b:a8:ed),
>>  |
>> >    |   ct(zone=11,nat),           |      |   set(ipv4 ttl=60),
>> |
>> >    |   recirc(0x314610)           |      |   ct(zone=11,nat),
>>  |
>> >    |                              |      |   recirc(0x314610)
>>  |
>> >    | pkts=565   bytes=44636       |      |
>> |
>> >    | used=1.640s                  |      | pkts=26,867,289
>> |
>> >    | offloaded:yes, dp:tc         |      | bytes=241,881,487,034
>> |
>> >    +------------------------------+      | used=0.620s
>> |
>> >                   |                      | offloaded:yes, dp:tc
>>  |
>> >                   |
>> +------------------------------+
>> >                   |                                     |
>> >                   +--------------+        +-------------+
>> >                                  |        |
>> >                                  v        v
>> >                   +--------------------------------------+
>> >                   | recirc_id(0x314610)        STAGE 2   |
>> >                   | ufid:1ee350bf                        |
>> >                   | in_port=enp210s0f0np0   (PF)         |
>> >                   | ct_state(0x2a/0x3f)  <-- mask 0x3f   |
>> >                   | ct_mark(0/0x41)                      |
>> >                   | eth(src=*, dst=ae:ad:c9:2a:9d:0f)    |
>> >                   | ipv4(src=*, dst=172.27.61.7,         |
>> >                   |      proto=0/0, ttl=0/0)             |
>> >                   |                                      |
>> >                   | actions: enp210s0f0_1   (VF)         |
>> >                   |                                      |
>> >                   | pkts=41,192,879                      |
>> >                   | bytes=2,502,536,363,732              |
>> >                   | used=0.020s, flags=SFPR.             |
>> >                   |                                      |
>> >                   | dp:ovs   <-- STRANDED, NOT OFFLOADED |
>> >                   +--------------------------------------+
>> >
>> >
>> >
>> -------------------------------------------------------------------------------
>> > 2. AFTER FLUSH (ovs-appctl dpctl/del-flows)
>> >
>> -------------------------------------------------------------------------------
>> >
>> > After `ovs-appctl dpctl/del-flows` everything is re-installed in the
>> > natural pipeline order, so the chain check passes for every stage.
>> > The megaflow masks have not been re-aggregated yet, so we see a
>> > "fanned out" pipeline:
>> >
>> >    +----------------+    +----------------+    +----------------+
>> >    | recirc_id(0)   |    | recirc_id(0)   |    | (parent for    |
>> >    |    BRANCH A    |    |    BRANCH B    |    |  chain         |
>> >    | 5 sub-megaflows|    | 1 megaflow     |    |  0x3229d9 had  |
>> >    | vlan 120       |    | vlan 26        |    |  aged out at   |
>> >    | zone 6 NAT     |    | zone 5 NAT     |    |  dump time --  |
>> >    |                |    |                |    |  the two       |
>> >    | dst=           |    | dst=           |    |  stage-1       |
>> >    |  AA.BB.CC.DD   |    |  XX.YY.ZZ.AA   |    |  flows below   |
>> >    | by src/ttl:    |    | src=8.0.0.0/5  |    |  had pkts=0)   |
>> >    |  104/5 ttl=56  |    |  ttl=62        |    |                |
>> >    |  32/3  ttl=119 |    |                |    |  ufid:1b6d210e |
>> >    |  124/7 ttl=234 |    | pkts=14,326,765|    |  -- not        |
>> >    |  32/3  ttl=122 |    | bytes=128.7 GB |    |     captured   |
>> >    |  192/3 ttl=243 |    | used=0.660s    |    |     for branch |
>> >    |                |    |                |    |     C          |
>> >    | actions:       |    | actions:       |    |                |
>> >    |  pop_vlan,     |    |  pop_vlan,     |    |                |
>> >    |  ct(zone=6,    |    |  ct(zone=5,    |    |                |
>> >    |     nat),      |    |     nat),      |    |                |
>> >    |  recirc(       |    |  recirc(       |    |                |
>> >    |   0x320213)    |    |   0x321f17)    |    |                |
>> >    | offloaded:yes  |    | offloaded:yes  |    |                |
>> >    | dp:tc          |    | dp:tc          |    |                |
>> >    +----------------+    +----------------+    +----------------+
>> >             |                    |                       :
>> >             v                    v                       v
>> >    +----------------+    +----------------+    +----------------+
>> >    | recirc_id      |    | recirc_id      |    | recirc_id      |
>> >    |  (0x320213)    |    |  (0x321f17)    |    |  (0x3229d9)    |
>> >    |                |    |                |    |                |
>> >    | 3 sub-megaflows|    | 1 megaflow     |    | 2 megaflows    |
>> >    | ct_state(      |    | ct_state(      |    | ct_state(      |
>> >    |  0x2a/0x3e)    |    |  0x2a/0x3e)    |    |  0x21/0x3f)    |
>> >    | (+est+rpl+trk) |    | (+est+rpl+trk) |    | (+new+trk)     |
>> >    |                |    |                |    |                |
>> >    | ttl 119 -> 118 |    | ttl 62  -> 60  |    | ttl 234 -> 233 |
>> >    | ttl 56  -> 55  |    |                |    | ttl 243 -> 242 |
>> >    | ttl 122 -> 121 |    | pkts=14,326,690|    |                |
>> >    |                |    | bytes=128.7 GB |    | pkts=0  (new   |
>> >    | pkts=68+9+1=78 |    | used=0.660s    |    |   conn attempts|
>> >    |                |    |                |    |   in flight)   |
>> >    | actions:       |    | actions:       |    |                |
>> >    |  ct_clear,     |    |  ct_clear,     |    | actions:       |
>> >    |  set(eth src=  |    |  set(eth src=  |    |  (same shape   |
>> >    |   1a:83:..),   |    |   1a:83:..),   |    |   as branch    |
>> >    |  set(ipv4 ttl  |    |  set(ipv4 ttl  |    |   A/B stage 1) |
>> >    |   -1),         |    |   -1),         |    |  recirc(       |
>> >    |  ct(zone=11,   |    |  ct(zone=11,   |    |   0x314610)    |
>> >    |   nat),        |    |   nat),        |    |                |
>> >    |  recirc(       |    |  recirc(       |    | offloaded:yes  |
>> >    |   0x314610)    |    |   0x314610)    |    | dp:tc          |
>> >    | offloaded:yes  |    | offloaded:yes  |    |                |
>> >    | dp:tc          |    | dp:tc          |    |                |
>> >    +----------------+    +----------------+    +----------------+
>> >             |                    |                       |
>> >             +--------+           |          +------------+
>> >                      |           |          |
>> >                      v           v          v
>> >             +-----------------------------------------+
>> >             | recirc_id(0x314610)     STAGE 2         |
>> >             |                                         |
>> >             | Three flows now (all offloaded:yes,     |
>> >             | dp:tc):                                 |
>> >             |                                         |
>> >             | 1. ufid:c51ef89d   <-- the umbrella     |
>> >             |    ct_state(0x2a/0x3e)  <-- mask 0x3e   |
>> >             |    ct_mark(0/0x41)                      |
>> >             |    eth(src=*, dst=ae:ad:c9:2a:9d:0f)    |
>> >             |    ipv4(dst=172.27.61.7)                |
>> >             |    actions: enp210s0f0_1   (VF)         |
>> >             |    pkts=14,326,720                      |
>> >             |    bytes=128,674,265,194                |
>> >             |    used=0.660s                          |
>> >             |                                         |
>> >             | 2. ufid:d6f6c8c3   (DROP, new conn ACL) |
>> >             |    ct_state(0x21/0x3f)  (+new+trk)      |
>> >             |    eth(src=1a:83:58:7b:a8:ed,           |
>> >             |        dst=ae:ad:00:00:00:00/           |
>> >             |            ff:ff:00:00:00:00)           |
>> >             |    dst=172.27.60.0/23,                  |
>> >             |    tcp ports w/ submask                 |
>> >             |    actions: drop                        |
>> >             |    pkts=0                               |
>> >             |                                         |
>> >             | 3. ufid:0b52d8bd   (DROP, new conn ACL) |
>> >             |    same shape, different tcp submask    |
>> >             |    actions: drop                        |
>> >             |    pkts=0                               |
>> >             +-----------------------------------------+
>> >
>> >
>> > (Note:  The above ascii graph is generated by Claude)
>> >
>> >
>> > In the OVS logs we also see the below msg (
>> https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2363
>> )
>> >
>> > ```
>> > 2026-06-01T21:15:33.774Z|10763|netdev_offload_tc(handler18)|DBG|
>> >   match for chain 3229200 failed due to non-existing goto chain action
>> > ```
>> >
>> > There seems to be a race condition during the ccmap 'used_chains'.
>> >
>> > As per Claude, the issue seems to be introduced in the commit :
>> > `273a4fce951a`** — `netdev-offload-tc: Only install recirc flows if the
>> parent is present.`
>> > and there is a possibility of a race window in the function
>> netdev_tc_flow_put()
>> > between
>> https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2695
>> > and
>> https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2730
>> >
>> > @Eelco @Ilya - Do you have any idea on what could be going on here ?
>>
>> Hi Numan,
>>
>> Sorry for the late response, but this message ended up in my
>> spam box which I was cleaning up :( Put Ilya also on the TO
>> line, maybe it ended up in his spam also.
>>
>> I'm on PTO on Monday, so will try to take a look at this later
>> in the week.
>>
>> //Eelco
>>
>
>
> Hi Eelco,
>
> Thanks for the reply.
>
> Below are some of the details I found during my investigation.
>
> OVN Logical topology
> ----
>
> VM1 -> VPC Logical switch (with ACLs) ->  Logical Router (NATs configured)
> ->
>    Public logical switch with localnet port ->  br-int <->  patch ports ->
> br-ex -> Physical NIC
>
> In this case, there is an iperf session between the VM1 and outside iperf
> server.
>
> When the reply traffic from iperf server to the VM 1 enters the host via
> the physical NIC,
> before the packet is delivered to the VM, there are in total 3
> recirculations
> (stage 0, 1 and 2) because of NAT in the logical router pipeline and ACLs
> in the logical switch
> pipeline -> recirc 0, recirc 4 and recirc 5 (for example).
>
> When the datapath flow for recirc 0 is offloaded, recirc 4 is saved in the
> "used_chains"
> (in tc_netdev_flow_put()) and when the dp flow with recirc 4 is offloaded,
> recirc 5 is saved
> in the "used_chains".
>
>
> What I noticed is that for some of the packets from the server to the VM,
> tcp.flags has push/ack set
> and for some only ack set.
>
> And ovs-vswitchd generates different ufid in both the cases, but after the
> flow translation
> same datapath flow is generated.
>
> Suppose if I delete the dp flows when iperf is running (ovs-appctl
> dpctl/del-flows),
> a lot of packets get upcalled.  What I noticed is one set of reply packets
> from server
> has tcp.flags == push/ack and another set with just tcp.flags == ack.
>
> When the handler thread does the flow translation for packets with
> [recirc=5, tcp.flags == push/ack]
> it generates ufid 'A' and offloads the flow if '5' is present in
> 'used_chains'
>
> And when the handler thread does the flow translation for packets with
> [recirc=5, tcp.flags == ack] it generates ufid 'B' and when it tries to
> offload,  tc returns EEXISTS because both ufid 'A' and 'B' generates
> the same datapath flow (as the datapath flow doesn't have matches for tcp
> flags).
>
>
> And I see the error message 'match for chain 5 failed due to non-existing
> goto chain action'
> if the stage 1 flow (with recirc 4) was deleted.
>
> I was able to fix this issue by hacking ovn-northd and adding the below
> logical flow
>
> table=6 (ls_out_acl_eval    ), priority=65533, match=(ct.est && !ct.rel &&
> ct.rpl && ct_mark.blocked == 0 && tcp && (tcp.flags == 24 || tcp.flags ==
> 16)), action=(reg8[21] = ct_label.nf; reg8[16] = 1; next;)
> table=6 (ls_out_acl_eval    ), priority=65532, match=(ct.est && !ct.rel &&
> ct.rpl && ct_mark.blocked == 0), action=(reg8[21] = ct_label.nf; reg8[16]
> = 1; next;)
>
> Priority 65532 is the existing logical flow which northd adds to allow all
> the established reply packets.
>
> Since priority 65533 flow matches on tcp.flags, now two distinct datapath
> flows are added and I do not see this race issue even after deleting the
> dpctl flows in a loop.  (I added a sleep of 8 seconds between the deletes).
>
> Without the hack in northd,  I'm able to reproduce the issue 100% of the
> time within the first 3 dp flow deletes in a fake-multinode environment.
>
> When the issue is seen, the first 2 dp flows (recirc 0 and 4) are
> offloaded to tc and the last one is not.
>
>
> When 2 different ufids result in the same dp flow, perhaps the
> tc_netdev_flow_put() should store both ufids in the ufid_to_tc_mappings ?
>
> Let me know if you want me to provide a script to reproduce using
> ovn-fake-multinode.
>

Please ignore my hack theory.  I deployed a simple fake multi node
environment and started the iperf between the fake vm "sw01p1" in
ovn-chassis-1
and the "ovnfake-ext1" namespace on the host.  And I was able to reproduce
the issue even with the hack and delete the dp flows.

You can run the command - watch -n1 'ovs-appctl dpctl/dump-flows -m | grep
"in_port(eth2"' in ovn-chassis-1 and notice that the dp flow
changes from "dp:tc" to "dp:ovs".


Below is my theory on why offload fails in our prod deployment

T1.  Packet P_A arrives (upcalled), hashed to ufid:A.
       xlate produces megaflow M1 with action set ending in
       recirc(R1).
       R1 = recirc_alloc_id_ctx(frozen-alpha).
       tc_netdev_flow_put(M1) fails (any reason -- unsupported
       action, EOPNOTSUPP from a probe, mlx5 driver capability
       mismatch, etc.).
       M1 lands in dp:ovs via dpif_netlink fallback.
       ccmap_inc(used_chains, R1) NEVER fires.

  T2.  P_A continues through openvswitch.ko, internal recirc to
       R1, miss, upcall.
       Hashed to ufid:B.
       xlate produces megaflow M2 matching on recirc_id=R1, with
       some output action (e.g. mirred to a port).
       tc_netdev_flow_put(M2):
         chain check: ccmap_find(used_chains, R1) == 0  ->  BAIL
         return EOPNOTSUPP.
       M2 lands in dp:ovs.

  T3.  A different packet P_C arrives, hashed to ufid:C.
       xlate produces megaflow M3 with a DIFFERENT match key but
       arrives at the same OF freeze point as M1's translation did.
       Same frozen_state -> same R1.
       M3 install SUCCEEDS in TC.
       ccmap_inc(used_chains, R1)  ->  count = 1.

  T4.  P_C continues, TC recircs to R1.
       TC chain R1 is empty (M2 is in dp:ovs, not TC).
       Packet falls to openvswitch.ko at recirc_id=R1.
       M2 matches; executes in SW.

Net result: M3 (upstream sibling) is in dp:tc, M2 (downstream) is
permanently in dp:ovs.  P_C traffic does TC-fast-path on the
upstream stage but kernel-SW on the downstream stage, every packet.

Does this theory make sense ?

Thanks
Numan



> Thanks
> Numan
>
>
>>
>> > Let me know if you need more information.  I'll try to debug further.
>> >
>> > Thanks
>> > Numan
>>
>>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to