On Tue, Dec 30, 2025 at 06:49:27PM +0800, LIU Yulong wrote:
> Our running ovs-vswitchd was cored with stack [1]. It shows a
> concurrent race condition.
>
> When the revalidator thread is performing rule revalidation/translation,
> another thread has already deleted the rule tag and reduced the reference
> count to 0. Using ofproto_rule_ref() at this point will trigger an assertion,
> as it assumes the object is still alive (refcount > 0).
>
> Replace the mandatory reference to the rule in xlate_table_action()
> with a non-fatal try-ref: ofproto_rule_try_ref().
> If the reference fails (the rule is being/has been released),
> treat it as a table miss path, avoiding referencing "dead" rules
> and causing assertion errors.
>
> [1] call stack:
> *0  raise () from /lib64/libc.so.6
> *1  abort () from /lib64/libc.so.6
> *2  ovs_abort_valist () at lib/util.c:499
> *3  vlog_abort_valist () at lib/vlog.c:1249
> *4  vlog_abort () at lib/vlog.c:1263
> *5  ovs_assert_failure () at lib/util.c:86
> *6  ovs_refcount_ref () at lib/ovs-atomic.h:547
> *7  ovs_refcount_ref () at ofproto/ofproto.c:3010
> *8  ofproto_rule_ref () at ofproto/ofproto.c:3012
> *9  xlate_table_action () at ofproto/ofproto-dpif-xlate.c:4506
> *10 do_xlate_actions () at ofproto/ofproto-dpif-xlate.c:7183
> *11 xlate_recursively () at ofproto/ofproto-dpif-xlate.c:4382
> *12 xlate_table_action () at ofproto/ofproto-dpif-xlate.c:4511
> *13 do_xlate_actions () at ofproto/ofproto-dpif-xlate.c:7183
> *14 clone_xlate_actions () at ofproto/ofproto-dpif-xlate.c:5809
> *15 xlate_recursively () at ofproto/ofproto-dpif-xlate.c:4382
> *16 xlate_table_action () at ofproto/ofproto-dpif-xlate.c:4511
> *17 patch_port_output () at ofproto/ofproto-dpif-xlate.c:3889
> *18 compose_output_action__ () at ofproto/ofproto-dpif-xlate.c:4204
> *19 compose_output_action () at ofproto/ofproto-dpif-xlate.c:4359
> *20 xlate_output_action () at ofproto/ofproto-dpif-xlate.c:5304
> *21 do_xlate_actions () at ofproto/ofproto-dpif-xlate.c:7033
> *22 xlate_actions () at ofproto/ofproto-dpif-xlate.c:8001
> *23 xlate_key () at ofproto/ofproto-dpif-upcall.c:2321
> *24 xlate_ukey () at ofproto/ofproto-dpif-upcall.c:2336
> *25 revalidate_ukey__ () at ofproto/ofproto-dpif-upcall.c:2382
> *26 revalidate_ukey () at ofproto/ofproto-dpif-upcall.c:2491
> *27 revalidate () at ofproto/ofproto-dpif-upcall.c:2945
> *28 udpif_revalidator () at ofproto/ofproto-dpif-upcall.c:1089
> *29 ovsthread_wrapper () at lib/ovs-thread.c:422
> *30 start_thread () from /lib64/libpthread.so.0
> *31 clone () from /lib64/libc.so.6
>
> Signed-off-by: LIU Yulong <[email protected]>

nit: As you point out, this is the second version of the patch, the
subject should reflect that by saying "[PATCH v2]"

> ---
>  ofproto/ofproto-dpif-xlate.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 920d998e6..b0d42d29d 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -4793,7 +4793,7 @@ xlate_table_action(struct xlate_ctx *ctx, ofp_port_t 
> in_port, uint8_t table_id,
>              tuple_swap(&ctx->xin->flow, ctx->wc);
>          }
>
> -        if (rule) {
> +        if (rule && ofproto_rule_try_ref(&rule->up)) {
>              /* Fill in the cache entry here instead of xlate_recursively
>               * to make the reference counting more explicit.  We take a
>               * reference in the lookups above if we are going to cache the
> @@ -4804,6 +4804,8 @@ xlate_table_action(struct xlate_ctx *ctx, ofp_port_t 
> in_port, uint8_t table_id,
>                  entry = xlate_cache_add_entry(ctx->xin->xcache, XC_RULE);
>                  entry->rule = rule;
>                  ofproto_rule_ref(&rule->up);
> +            } else {
> +                ofproto_rule_unref(&rule->up);

AFAICS, this would also leak a reference:
ofproto_rule_try_ref() above would increase the reference count.
If the xcache exists, we increase it again and not release it (else
clause not executed). So we end up having increased the references by 2.

One of the refereces will be dropped by the xcache when it releases the
entry but the other one is leaked.

I think the refcount should be released unconditionally, i.e:

diff --git i/ofproto/ofproto-dpif-xlate.c w/ofproto/ofproto-dpif-xlate.c
index b0d42d29d..577ddec1e 100644
--- i/ofproto/ofproto-dpif-xlate.c
+++ w/ofproto/ofproto-dpif-xlate.c
@@ -4804,9 +4804,8 @@ xlate_table_action(struct xlate_ctx *ctx,
ofp_port_t in_port, uint8_t table_id,
                 entry = xlate_cache_add_entry(ctx->xin->xcache, XC_RULE);
                 entry->rule = rule;
                 ofproto_rule_ref(&rule->up);
-            } else {
-                ofproto_rule_unref(&rule->up);
             }
+            ofproto_rule_unref(&rule->up);

             struct ovs_list *old_trace = ctx->xin->trace;
             xlate_report_table(ctx, rule, table_id);


Also, xlate_actions() has a piece of code pretty similar to this. Does
it need fixing as well?

Thanks.
Adrián

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to