On 10/21/25 11:07 AM, Eelco Chaudron via dev wrote:
> 
> 
> On 21 Oct 2025, at 10:46, LIU Yulong wrote:
> 
>> Our running ovs-vswitchd was cored with stack [1]. It shows a
>> concurrent race condition.
>>
>> When the revalidator thread is performing rule revalidation/translation,
>> another thread has already deleted the rule tag and reduced the reference
>> count to 0. Using ofproto_rule_ref() at this point will trigger an assertion,
>> as it assumes the object is still alive (refcount > 0).
>>
>> Replace the mandatory reference to the rule in xlate_table_action()
>> with a non-fatal try-ref: ofproto_rule_try_ref().
>> If the reference fails (the rule is being/has been released),
>> treat it as a table miss path, avoiding referencing "dead" rules
>> and causing assertion errors.
>>
>> [1] call stack:
>> *0  0x00007f2e2950c337 in raise () from /lib64/libc.so.6
>> *1  0x00007f2e2950da28 in abort () from /lib64/libc.so.6
>> *2  0x00005581a8eb3c4e in ovs_abort_valist (err_no=err_no@entry=0,
>>     format=format@entry=0x5581a93191d0 "%s: assertion %s failed in %s()",
>>     args=args@entry=0x7f2e17327700) at lib/util.c:499
>> *3  0x00005581a8ebb7a0 in vlog_abort_valist (module_=<optimized out>,
>>     message=0x5581a93191d0 "%s: assertion %s failed in %s()",
>>     args=args@entry=0x7f2e17327700) at lib/vlog.c:1249
>> *4  0x00005581a8ebb834 in vlog_abort (module=module@entry=0x5581a9888960 
>> <this_module>,
>>     message=message@entry=0x5581a93191d0 "%s: assertion %s failed in %s()") 
>> at lib/vlog.c:1263
>> *5  0x00005581a8eb386c in ovs_assert_failure 
>> (where=where@entry=0x5581a92e95e8 "./lib/ovs-atomic.h:547",
>>     function=function@entry=0x5581a92eb7a0 <__func__.6720> 
>> "ovs_refcount_ref",
>>     condition=condition@entry=0x5581a92e95d7 "old_refcount > 0") at 
>> lib/util.c:86
>> *6  0x00005581a2b33dd1 in ovs_refcount_ref (refcount=0x5581aca1ef28) at 
>> lib/ovs-atomic.h:547
>> *7  0x00005581a8d82afb in ovs_refcount_ref (refcount=0x5581aca1ef28) at 
>> ofproto/ofproto.c:3010
>> *8  ofproto_rule_ref (rule=rule@entry=0x5581aca1eeb0) at 
>> ofproto/ofproto.c:3012
>> *9  0x00005581a8dacfbd in xlate_table_action (ctx=0x7f2e1732bcb0, 
>> in_port=<optimized out>,
>>     table_id=<optimized out>, may_packet_in=<optimized out>, 
>> honor_table_miss=<optimized out>,
>>     with_ct_orig=<optimized out>, is_last_action=true,
>>     xlator=0x5581a8daf850 <do_xlate_actions>) at 
>> ofproto/ofproto-dpif-xlate.c:4506
>> *10 0x00005581a8db0fdc in do_xlate_actions (ofpacts=<optimized out>, 
>> ofpacts_len=<optimized out>,
>>     ctx=0x7f2e1732bcb0, is_last_action=<optimized out>,
>>     group_bucket_action=<optimized out>) at ofproto/ofproto-dpif-xlate.c:7183
>> *11 0x00005581a8dad045 in xlate_recursively (actions_xlator=0x5581a8daf850 
>> <do_xlate_actions>,
>>     is_last_action=true, deepens=false, rule=0x5581ac88e440, 
>> ctx=0x7f2e1732bcb0) at ofproto/ofproto-dpif-xlate.c:4382
>> *12 xlate_table_action (ctx=0x7f2e1732bcb0, in_port=<optimized out>, 
>> table_id=<optimized out>,
>>     may_packet_in=<optimized out>, honor_table_miss=<optimized out>,
>>     with_ct_orig=<optimized out>, is_last_action=true,
>>     xlator=0x5581a8daf850 <do_xlate_actions>) at 
>> ofproto/ofproto-dpif-xlate.c:4511
>> *13 0x00005581a8db0fdc in do_xlate_actions 
>> (ofpacts=ofpacts@entry=0x5581ac8912d8,
>>     ofpacts_len=ofpacts_len@entry=8, ctx=ctx@entry=0x7f2e1732bcb0,
>>     is_last_action=is_last_action@entry=true,
>>     group_bucket_action=group_bucket_action@entry=false) at 
>> ofproto/ofproto-dpif-xlate.c:7183
>> *14 0x00005581a8db74a8 in clone_xlate_actions (actions=0x5581ac8912d8, 
>> actions_len=8,
>>     ctx=0x7f2e1732bcb0, is_last_action=<optimized out>,
>>     group_bucket_action=<optimized out>) at ofproto/ofproto-dpif-xlate.c:5809
>> *15 0x00005581a8dad045 in xlate_recursively (actions_xlator=0x5581a8db72b0 
>> <clone_xlate_actions>,
>>     is_last_action=true, deepens=true, rule=0x5581ac88d990,
>>     ctx=0x7f2e1732bcb0) at ofproto/ofproto-dpif-xlate.c:4382
>> *16 xlate_table_action (ctx=0x7f2e1732bcb0, in_port=<optimized out>, 
>> table_id=<optimized out>,
>>     may_packet_in=<optimized out>, honor_table_miss=<optimized out>,
>>     with_ct_orig=<optimized out>, is_last_action=true,
>>     xlator=0x5581a8db72b0 <clone_xlate_actions>) at 
>> ofproto/ofproto-dpif-xlate.c:4511
>> *17 0x00005581a8db3470 in patch_port_output (ctx=ctx@entry=0x7f2e1732bcb0,
>>     out_dev=0x5581acbef790, is_last_action=is_last_action@entry=true,
>>     in_dev=0x5581acce4210, in_dev=0x5581acce4210) at 
>> ofproto/ofproto-dpif-xlate.c:3889
>> *18 0x00005581a8db3847 in compose_output_action__ 
>> (ctx=ctx@entry=0x7f2e1732bcb0,
>>     ofp_port=2, xr=xr@entry=0x0, check_stp=check_stp@entry=true,
>>     is_last_action=is_last_action@entry=true, truncate=truncate@entry=false)
>>     at ofproto/ofproto-dpif-xlate.c:4204
>> *19 0x00005581a8db6310 in compose_output_action (truncate=false,
>>     is_last_action=true, xr=0x0, ofp_port=<optimized out>,
>>     ctx=0x7f2e1732bcb0) at ofproto/ofproto-dpif-xlate.c:4359
>> *20 xlate_output_action (ctx=ctx@entry=0x7f2e1732bcb0,
>>     port=<optimized out>, controller_len=<optimized out>,
>>     may_packet_in=may_packet_in@entry=true, 
>> is_last_action=is_last_action@entry=true,
>>     truncate=truncate@entry=false,
>>     group_bucket_action=group_bucket_action@entry=false) at 
>> ofproto/ofproto-dpif-xlate.c:5304
>> *21 0x00005581a8dafae4 in do_xlate_actions 
>> (ofpacts=ofpacts@entry=0x5581ac64a918,
>>     ofpacts_len=ofpacts_len@entry=48, ctx=ctx@entry=0x7f2e1732bcb0,
>>     is_last_action=is_last_action@entry=true,
>>     group_bucket_action=group_bucket_action@entry=false) at 
>> ofproto/ofproto-dpif-xlate.c:7033
>> *22 0x00005581a8db9616 in xlate_actions (xin=xin@entry=0x7f2e1732c570,
>>     xout=xout@entry=0x7f2e1732cea0) at ofproto/ofproto-dpif-xlate.c:8001
>> *23 0x00005581a8da54a0 in xlate_key (key=<optimized out>, len=<optimized 
>> out>,
>>     push=push@entry=0x7f2e1732c920, ctx=ctx@entry=0x7f2e1732ce80,
>>     udpif=<optimized out>) at ofproto/ofproto-dpif-upcall.c:2321
>> *24 0x00005581a8da5b54 in xlate_ukey (ukey=0x7f2ddea72f40,
>>     ukey=0x7f2ddea72f40, ctx=0x7f2e1732ce80, tcp_flags=<optimized out>,
>>     udpif=0x5581ac626e00) at ofproto/ofproto-dpif-upcall.c:2336
>> *25 revalidate_ukey__ (udpif=udpif@entry=0x5581ac626e00,
>>     ukey=ukey@entry=0x7f2ddea72f40, tcp_flags=<optimized out>,
>>     odp_actions=0x7f2e1732d340, recircs=recircs@entry=0x7f2e1732d310,
>>     xcache=<optimized out>) at ofproto/ofproto-dpif-upcall.c:2382
>> *26 0x00005581a8da5dd1 in revalidate_ukey (udpif=udpif@entry=0x5581ac626e00,
>>     ukey=ukey@entry=0x7f2ddea72f40, stats=stats@entry=0x7f2e1732d320,
>>     odp_actions=odp_actions@entry=0x7f2e1732d340,
>>     reval_seq=reval_seq@entry=3127840410,
>>     recircs=recircs@entry=0x7f2e1732d310) at 
>> ofproto/ofproto-dpif-upcall.c:2491
>> *27 0x00005581a8daa1d2 in revalidate 
>> (revalidator=revalidator@entry=0x5581ac672690) at 
>> ofproto/ofproto-dpif-upcall.c:2945
>> *28 0x00005581a8daa514 in udpif_revalidator (arg=0x5581ac672690) at 
>> ofproto/ofproto-dpif-upcall.c:1089
>> *29 0x00005581a8e7f61f in ovsthread_wrapper (aux_=<optimized out>) at 
>> lib/ovs-thread.c:422
>> *30 0x00007f2e2b662e65 in start_thread () from /lib64/libpthread.so.0
>> *31 0x00007f2e295d488d in clone () from /lib64/libc.so.6

nit: the trace can be cleaned up, so it's easier to read.  We do not
need any actual memory addresses (at least, the function addresses
are not needed) and most of the function arguments.  Most important
here and function names and positions in files.

>>
>> Signed-off-by: LIU Yulong <[email protected]>
> 
> Thanks for the patch, see comment below.
> 
>> ---
>>  ofproto/ofproto-dpif-xlate.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
>> index 2c8197fb7..cc8c75b90 100644
>> --- a/ofproto/ofproto-dpif-xlate.c
>> +++ b/ofproto/ofproto-dpif-xlate.c
>> @@ -4793,6 +4793,10 @@ xlate_table_action(struct xlate_ctx *ctx, ofp_port_t 
>> in_port, uint8_t table_id,
>>              tuple_swap(&ctx->xin->flow, ctx->wc);
>>          }
>>
>> +        if (rule && !ofproto_rule_try_ref(&rule->up)) {
>> +            rule = NULL;
>> +        }
> 
> The call above takes a reference to the rule, but that reference is never 
> released. As a result, the rules will never be freed, causing a memory leak.
> 
> You might skip taking the reference below, or in the case you do not need it, 
> release it.
> 
> Something like this (not tested at all):
> 
> ++++          if (rule && ofproto_rule_try_ref(&rule->up)) {
> 4797              /* Fill in the cache entry here instead of xlate_recursively
> 4798               * to make the reference counting more explicit.  We take a
> 4799               * reference in the lookups above if we are going to cache 
> the
> 4800               * rule. */
> 4801              if (ctx->xin->xcache) {
> 4802                  struct xc_entry *entry;
> 4803
> 4804                  entry = xlate_cache_add_entry(ctx->xin->xcache, 
> XC_RULE);
> 4805                  entry->rule = rule;
> ----                  ofproto_rule_ref(&rule->up);
> ++++              } else {
> ++++                  ofproto_rule_unref(&rule->up);
> ++++              }
> 4808
> 4809              struct ovs_list *old_trace = ctx->xin->trace;
> 4810              xlate_report_table(ctx, rule, table_id);
> 

One more thing here is that we should still allow the use of this rule
even if we can't take the reference, unless we need to cache it.
If we don't need to cache it, there is no need to take the reference,
and the rule should be valid until the next RCU quiescent period.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to