On 1/14/26 7:32 AM, LIU Yulong wrote:
> Our running ovs-vswitchd was cored with stack [1]. It shows a
> concurrent race condition.
>
> When the revalidator thread is performing rule revalidation/translation,
> another thread has already deleted the rule tag and reduced the reference
> count to 0. Using ofproto_rule_ref() at this point will trigger an assertion,
> as it assumes the object is still alive (refcount > 0).
>
> Replace the mandatory reference to the rule in xlate_table_action()
> with a non-fatal try-ref: ofproto_rule_try_ref().
> If the reference fails (the rule is being/has been released),
> treat it as a table miss path, avoiding referencing "dead" rules
> and causing assertion errors.
>
> [1] call stack:
> *0 raise () from /lib64/libc.so.6
> *1 abort () from /lib64/libc.so.6
> *2 ovs_abort_valist () at lib/util.c:499
> *3 vlog_abort_valist () at lib/vlog.c:1249
> *4 vlog_abort () at lib/vlog.c:1263
> *5 ovs_assert_failure () at lib/util.c:86
> *6 ovs_refcount_ref () at lib/ovs-atomic.h:547
> *7 ovs_refcount_ref () at ofproto/ofproto.c:3010
> *8 ofproto_rule_ref () at ofproto/ofproto.c:3012
> *9 xlate_table_action () at ofproto/ofproto-dpif-xlate.c:4506
> *10 do_xlate_actions () at ofproto/ofproto-dpif-xlate.c:7183
> *11 xlate_recursively () at ofproto/ofproto-dpif-xlate.c:4382
> *12 xlate_table_action () at ofproto/ofproto-dpif-xlate.c:4511
> *13 do_xlate_actions () at ofproto/ofproto-dpif-xlate.c:7183
> *14 clone_xlate_actions () at ofproto/ofproto-dpif-xlate.c:5809
> *15 xlate_recursively () at ofproto/ofproto-dpif-xlate.c:4382
> *16 xlate_table_action () at ofproto/ofproto-dpif-xlate.c:4511
> *17 patch_port_output () at ofproto/ofproto-dpif-xlate.c:3889
> *18 compose_output_action__ () at ofproto/ofproto-dpif-xlate.c:4204
> *19 compose_output_action () at ofproto/ofproto-dpif-xlate.c:4359
> *20 xlate_output_action () at ofproto/ofproto-dpif-xlate.c:5304
> *21 do_xlate_actions () at ofproto/ofproto-dpif-xlate.c:7033
> *22 xlate_actions () at ofproto/ofproto-dpif-xlate.c:8001
> *23 xlate_key () at ofproto/ofproto-dpif-upcall.c:2321
> *24 xlate_ukey () at ofproto/ofproto-dpif-upcall.c:2336
> *25 revalidate_ukey__ () at ofproto/ofproto-dpif-upcall.c:2382
> *26 revalidate_ukey () at ofproto/ofproto-dpif-upcall.c:2491
> *27 revalidate () at ofproto/ofproto-dpif-upcall.c:2945
> *28 udpif_revalidator () at ofproto/ofproto-dpif-upcall.c:1089
> *29 ovsthread_wrapper () at lib/ovs-thread.c:422
> *30 start_thread () from /lib64/libpthread.so.0
> *31 clone () from /lib64/libc.so.6
>
> Signed-off-by: LIU Yulong <[email protected]>
> ---
> V4:
> - Address ci and clang-analyze failure
> V3:
> - Addressed Addressed Ilya's and Adrián's comments:
> - release the reference unconditionally
> - Addressed Adrián's comments:
> - make code consistent in xlate_actions()
> V2:
> - Addressed Eelco's comments:
> - release the reference after ofproto_rule_try_ref()
> - Addressed Ilya's comments:
> - remove the commit message function args and mem addresses
> ---
> ofproto/ofproto-dpif-xlate.c | 20 ++++++++++++--------
> 1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 920d998e6..d21f7f003 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -4793,7 +4793,7 @@ xlate_table_action(struct xlate_ctx *ctx, ofp_port_t
> in_port, uint8_t table_id,
> tuple_swap(&ctx->xin->flow, ctx->wc);
> }
>
> - if (rule) {
> + if (rule && ofproto_rule_try_ref(&rule->up)) {
This still doesn't allow to use the rule when we're not caching.
As I said in my review for the very first version of the patch, if
we're not going to cache, we should not attempt to take the reference
and we should still proceed flow translation with the rule, as it
should be around for the current grace period. It would look closer
to how the first version of this patch looked like:
if (rule && ctx->xin->xcache && !ofproto_rule_try_ref(&rule->up)) {
rule = NULL;
}
Same for the other part inside the xlate_actions() function.
However, it doesn't seem like the right solution in general. Reading
the original change from 2015 that changed the code to take references
directly:
1e1e1d19e6bb ("ofproto-dpif: Use a regular ref instead of try_ref for rule
translation.")
The logic in there still holds. For the issue to happen we need the
classifier to have the rule in it during the current grace period, but
also somehow the reference count should go down to zero within the same
RCU grace period. That should not be possible, as the reference count
decrease is on itself is RCU-postponed. So, after removal from the
classifier, the rule must still have the non-zero reference until the
next RCU grace period. It means that if we were able to look this rule
up, it must have a non-zero reference count. And so we should not need
to use the try_ref functions.
It looks like this crash is just a symptom of the RCU management going
wrong somewhere else, and the fact of taking the reference in the flow
translation code is not the root cause of the issue and we shouldn't
try to hide it by using a try_ref.
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev