On 5/7/26 4:31 PM, Dumitru Ceara wrote:
> Hi all,
> 
> It seems that the test added by this patch of mine sometimes fails in
> GitHub CI when ovs-vswitchd is stopped; the failure is due to:
> 
> ./ovn.at:10193: check_logs "
>         $error
>         /connection failed (No such file or directory)/d
>         /has no network name*/d
>         /receive tunnel port not found*/d
>         /Failed to locate tunnel to reach main chassis/d
>         /Transaction causes multiple rows.*MAC_Binding/d
>         /Transaction causes multiple rows.*FDB/d
>     " $sbox
> --- /dev/null 2026-05-06 14:47:18.250105001 +0000
> +++ /workspace/ovn-tmp/tests/testsuite.dir/at-groups/155/stdout
> 2026-05-06 14:54:35.770811381 +0000
> @@ -0,0 +1,2 @@
> +2026-05-06T14:54:35.548Z|00471|ofproto_dpif_rid|ERR|recirc_id 4 left
> allocated when ofproto (br-int) is destructed
> +2026-05-06T14:54:35.548Z|00472|ofproto_dpif_rid|ERR|recirc_id 2 left
> allocated when ofproto (br-int) is destructed
> 
> https://github.com/ovsrobot/ovn/actions/runs/25442425151/job/74637759422#step:12:5664
> 
> I'm failing to reproduce the issue locally but I'll keep investigating.
> 

+Eelco as we were discussing offline about this earlier.

So, if I understand correctly, it's just a race when ovs-vswitchd exits:

ofproto-dpif.c:destruct() is called, it removes all OF rules and then
calls recirc_free_ofproto() which assumes there's no remaining recirc_id.

However, close_dpif_backer() which in turn calls udpif_destroy() and
deletes all the ukeys (including recirc_ids they might have) is only
called at the end of destruct().

static void
destruct(struct ofproto *ofproto_, bool del)
{
...
    ofproto->backer->need_revalidate = REV_RECONFIGURE;
    xlate_txn_start();
    xlate_remove_ofproto(ofproto);
    xlate_txn_commit();

    hmap_remove(&all_ofproto_dpifs_by_name,
                &ofproto->all_ofproto_dpifs_by_name_node);

    OFPROTO_FOR_EACH_TABLE (table, &ofproto->up) {
        CLS_FOR_EACH (rule, up.cr, &table->cls) {
            ofproto_rule_delete(&ofproto->up, &rule->up);
        }
    }
...

    recirc_free_ofproto(ofproto, ofproto->up.name);
...
    /* Wait for all the meter destroy work to finish. */
    ovsrcu_barrier();
    close_dpif_backer(ofproto->backer, del);
}

Is this something that's worth addressing in OVS?  I understand it's
benign because we're exiting anyway but I am a bit surprised we don't
hit it more often in tests (OVN tests too).

Alternatively, would it be an option to change OVN's OVN_CLEANUP_VSWITCH
macro to call revalidator/purge before stopping vswitchd?  I.e.:

diff --git a/tests/ovn-macros.at b/tests/ovn-macros.at
index 39f03ba62b..b0b22e2fe6 100644
--- a/tests/ovn-macros.at
+++ b/tests/ovn-macros.at
@@ -223,6 +223,7 @@ m4_define([OVN_CLEANUP_VSWITCH],[
     echo
     echo "$1: clean up vswitch"
     as $1
+    ovs-appctl revalidator/purge
     OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
     OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 ])

Thanks,
Dumitru

> Regards,
> Dumitru
> 
> On 5/6/26 1:49 PM, Dumitru Ceara wrote:
>> On 5/6/26 10:33 AM, Mairtin O'Loingsigh wrote:
>>> On Mon, May 04, 2026 at 09:05:40AM +0200, Dumitru Ceara wrote:
>>>> Hi Mairtin,
>>>>
>>>> Thanks for the review!
>>>>
>>>> On 4/29/26 10:28 AM, Mairtin O'Loingsigh wrote:
>>>>> On Fri, Apr 24, 2026 at 05:35:58PM +0200, Dumitru Ceara via dev wrote:
>>>>>> The ARP/ND responder stage (ls_in_arp_rsp) unconditionally
>>>>>> bypassed all traffic arriving from localnet ports via a
>>>>>> priority-100 "next;" flow.  This caused broadcast ARP/ND
>>>>>> requests from the physical network to be flooded to every
>>>>>> logical switch port instead of being handled by proxy
>>>>>> ARP/ND.  On switches with ~200+ ports the resulting
>>>>>> multicast replication exceeded the OVS 4K resubmit limit,
>>>>>> dropping the packets and breaking connectivity.
>>>>>>
>>>>>> Replace the bypass with a targeted mechanism:
>>>>>>
>>>>>>   - In ls_in_lookup_fdb, set flags.localnet = 1 for
>>>>>>     packets arriving from localnet ports (P50 fallback;
>>>>>>     the existing P100 FDB-learning flow already sets this
>>>>>>     flag when FDB learning is enabled).
>>>>>>
>>>>>>   - In the P50 ARP/ND reply flows, append the condition
>>>>>>     "((flags.localnet == 1 && is_chassis_resident(port))
>>>>>>      || flags.localnet == 0)" on switches that have
>>>>>>     localnet ports.
>>>>>>
>>>>>> This ensures that ARP/ND requests from localnet are only
>>>>>> answered on the chassis hosting the target VIF, preventing
>>>>>> both the flood and duplicate replies from multiple
>>>>>> hypervisors.  VIF-to-VIF proxy ARP/ND is unchanged because
>>>>>> flags.localnet is 0 for non-localnet-sourced traffic.
>>>>>>
>>>>>> Fixes: f763a3273b84 ("ovn: Avoid ARP responder for packets from localnet 
>>>>>> port")
>>>>>> Reported-at: https://redhat.atlassian.net/browse/FDP-3436
>>>>>> Assisted-by: Claude Opus 4.6, Claude Code
>>>>>> Signed-off-by: Dumitru Ceara <[email protected]>
>>>>>> ---
>>>>
>>>> [...]
>>>>
>>>>>>  
>>>>>> +/* On switches with localnet ports, restrict ARP/ND replies for
>>>>>> + * localnet-sourced requests to the chassis hosting the target VIF
>>>>>> + * (preventing duplicate replies from every hypervisor).  Non-localnet
>>>>>> + * requests (VIF-to-VIF) are answered unconditionally as before. */
>>>>>> +static void
>>>>>> +build_lswitch_arp_nd_local_resp_match(struct ds *match,
>>>>>> +                                      const struct ovn_port *op)
>>>>>> +{
>>>>>> +    if (!ls_has_localnet_port(op->od)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    ds_put_format(match,
>>>>>> +        " && ((flags.localnet == 1 && is_chassis_resident(%s))"
>>>>>> +            " || flags.localnet == 0)", op->json_key);
>>>>> nit: spacing
>>>>
>>>> I had actually done this on purpose to make it a bit more visible that "
>>>> || flags.localnet == 0" is part of the condition in parenthesis.  But I
>>>> have no strong preference in the end.  Please let me know if you still
>>>> would like me to change it.
>>>>
>>>>>> +}
>>>>>> +
>>>>
>>>> [...]
>>>>
>>>>>>
>>>>> LGTM. Just one small nit.
>>>>>
>>>>> Acked-by: Mairtin O'Loingsigh <[email protected]>
>>>>>
>>>>
>>>> Regards,
>>>> Dumitru
>>>>
>>>
>>> This spacing does look more readable. No need to change.
>>>
>>
>> Hi Mairtin,
>>
>> Thanks for the confirmation!  Applied to main and all stable branches
>> down to 24.03.
>>
>> Regards,
>> Dumitru
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to