Maybe we should use the patch I uploaded 19 month ago:
https://mail.openvswitch.org/pipermail/ovs-dev/2024-March/412491.html


It did solve the issue for about 2 years.


 
------------------ Original ------------------
From: &nbsp;"LIU&nbsp;Yulong"<[email protected]&gt;;
Date: &nbsp;Wed, Oct 22, 2025 06:03 PM
To: &nbsp;"liuyulong"<[email protected]&gt;; "Eelco 
Chaudron"<[email protected]&gt;; 
Cc: &nbsp;"dev"<[email protected]&gt;; 
Subject: &nbsp;Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add ovsrcu_postpone 
for ukey_delete__.

&nbsp;
Updates:
2. The change of function `upcall_receive` did not solve the issuse, 
ovs-vswitchd still get cored:
#0&nbsp; 0x00007f7bf21ca337 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1&nbsp; 0x00007f7bf21cba28 in __GI_abort () at abort.c:90
#2&nbsp; 0x000055811c97cc6e in ovs_abort_valist (err_no=<optimized out&gt;, 
format=<optimized out&gt;, args=args@entry=0x7f7bdfffa360) at lib/util.c:499
#3&nbsp; 0x000055811c97cd04 in ovs_abort (err_no=err_no@entry=0, 
format=format@entry=0x55811cddaec0 "%s: %s() passed uninitialized ovs_mutex") 
at lib/util.c:491
#4&nbsp; 0x000055811c947a81 in ovs_mutex_trylock_at 
(l_=l_@entry=0x7f7bc9a751f8, where=where@entry=0x55811cdb7e78 
"ofproto/ofproto-dpif-upcall.c:3027") at lib/ovs-thread.c:106
#5&nbsp; 0x000055811c86f4f1 in revalidator_sweep__ 
(revalidator=revalidator@entry=0x558120f6df00, purge=purge@entry=false) at 
ofproto/ofproto-dpif-upcall.c:3027
#6&nbsp; 0x000055811c873516 in revalidator_sweep (revalidator=0x558120f6df00) 
at ofproto/ofproto-dpif-upcall.c:3085
#7&nbsp; udpif_revalidator (arg=0x558120f6df00) at 
ofproto/ofproto-dpif-upcall.c:1093
#8&nbsp; 0x000055811c94863f in ovsthread_wrapper (aux_=<optimized out&gt;) at 
lib/ovs-thread.c:422
#9&nbsp; 0x00007f7bf4321e65 in start_thread (arg=0x7f7bdffff700) at 
pthread_create.c:307
#10 0x00007f7bf229288d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111


------------------ Original ------------------
From: "LIU Yulong"<[email protected]&gt;;
Date: Wed, Oct 22, 2025 01:54 PM
To: "Eelco Chaudron"<[email protected]&gt;; 
Cc: "dev"<[email protected]&gt;; 
Subject: Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add ovsrcu_postpone for 
ukey_delete__.


Updates:
1. The change of function `upcall_uninit` did not solve the issue, ovs-vswitchd 
can still run cored from call `ukey_delete(umap, ukey);` in the 
`revalidator_sweep__`.
2. The change of function `upcall_receive` was applied to another host, and we 
do not see core issue for 24h. We need to run it for a longer period of time to 
verify.




------------------ Original ------------------
From:&nbsp; "Eelco Chaudron"<[email protected]&gt;;
Date:&nbsp; Mon, Oct 20, 2025 07:04 PM
To:&nbsp; "LIU Yulong"<[email protected]&gt;; 
Cc:&nbsp; "dev"<[email protected]&gt;; 
Subject:&nbsp; Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add ovsrcu_postpone 
for ukey_delete__.


On 20 Oct 2025, at 12:35, LIU Yulong wrote:

&gt; Thank you Eelco.
&gt;
&gt;
&gt; Code search shows we have `recv_upcalls` and `upcall_cb` which will call 
`upcall_uninit`.
&gt; And dp_netdev_upcall will call the dp-&gt;upcall_cb.
&gt; So we have call stacks like this:
&gt; i) 
handle_packet_upcall-&gt;dp_netdev_upcall-&gt;upcall_cb-&gt;upcall_uninit
&gt; ii) 
dp_execute_userspace_action-&gt;dp_netdev_upcall-&gt;upcall_cb-&gt;upcall_uninit
&gt;
&gt; Cloud you confirm these calls?

From the top of my head, this is correct. However, the new ukey structure is 
never inserted, so we do not need the RCU-delayed remove.

&gt; For your change, I'll run tests with recoreded packets to verify.

Thanks, and let me know the results.

//Eelco

&gt;
&gt; Regards,
&gt;
&gt;
&gt; LIU Yulong
&gt;&nbsp; 
&gt;&nbsp; 
&gt; ------------------ Original ------------------
&gt; From:&nbsp; "Eelco Chaudron"<[email protected]&gt;;
&gt; Date:&nbsp; Fri, Oct 17, 2025 08:08 PM
&gt; To:&nbsp; "LIU Yulong"<[email protected]&gt;;
&gt; Cc:&nbsp; "dev"<[email protected]&gt;;
&gt; Subject:&nbsp; Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add 
ovsrcu_postpone for ukey_delete__.
&gt;
&gt;&nbsp; 
&gt; Hi Liu,
&gt;
&gt; I looked at the change; however, upcall_uninit() is only called for newly 
created (never inserted) ukeys, so the ovs_postpone() call is not needed.
&gt;
&gt; However, I did find an issue in upcall_receive(), where, in an error path, 
it could use an uninitialized upcall structure — causing a ukey to be freed 
that should not have been.
&gt;
&gt; Can you try out the diff below to see if it fixes your problem?
&gt;
&gt; Cheers,
&gt;
&gt; Eelco
&gt;
&gt; diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
&gt; index b3b4b2d2f..53b906a16 100644
&gt; --- a/ofproto/ofproto-dpif-upcall.c
&gt; +++ b/ofproto/ofproto-dpif-upcall.c
&gt; @@ -1230,6 +1230,17 @@ upcall_receive(struct upcall *upcall, const struct 
dpif_backer *backer,
&gt; {
&gt;&nbsp; int error;
&gt;
&gt; +&nbsp; &nbsp; /* Initialize the minimal required fields in the upcall 
structure to ensure
&gt; +&nbsp;&nbsp; &nbsp; * upcall_uninit() does not operate on invalid data. */
&gt; +&nbsp; &nbsp; upcall-&gt;have_recirc_ref = false;
&gt; +&nbsp; &nbsp; upcall-&gt;xout_initialized = false;
&gt; +&nbsp; &nbsp; upcall-&gt;ukey_persists = false;
&gt; +&nbsp; &nbsp; upcall-&gt;ukey = NULL;
&gt; +&nbsp; &nbsp; ofpbuf_use_stub(&amp;upcall-&gt;odp_actions, 
upcall-&gt;odp_actions_stub,
&gt; 
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 &nbsp; sizeof upcall-&gt;odp_actions_stub);
&gt; +&nbsp; &nbsp; ofpbuf_init(&amp;upcall-&gt;put_actions, 0);
&gt; +
&gt; +
&gt;&nbsp; upcall-&gt;type = classify_upcall(type, userdata, 
&amp;upcall-&gt;cookie);
&gt;&nbsp; if (upcall-&gt;type == BAD_UPCALL) {
&gt;&nbsp; return EAGAIN;
&gt; @@ -1258,19 +1269,11 @@ upcall_receive(struct upcall *upcall, const struct 
dpif_backer *backer,
&gt;&nbsp; }
&gt;
&gt;&nbsp; upcall-&gt;recirc = NULL;
&gt; -&nbsp; &nbsp; upcall-&gt;have_recirc_ref = false;
&gt;&nbsp; upcall-&gt;flow = flow;
&gt;&nbsp; upcall-&gt;packet = packet;
&gt;&nbsp; upcall-&gt;ufid = ufid;
&gt;&nbsp; upcall-&gt;pmd_id = pmd_id;
&gt; -&nbsp; &nbsp; ofpbuf_use_stub(&amp;upcall-&gt;odp_actions, 
upcall-&gt;odp_actions_stub,
&gt; 
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 &nbsp; sizeof upcall-&gt;odp_actions_stub);
&gt; -&nbsp; &nbsp; ofpbuf_init(&amp;upcall-&gt;put_actions, 0);
&gt;
&gt; -&nbsp; &nbsp; upcall-&gt;xout_initialized = false;
&gt; -&nbsp; &nbsp; upcall-&gt;ukey_persists = false;
&gt; -
&gt; -&nbsp; &nbsp; upcall-&gt;ukey = NULL;
&gt;&nbsp; upcall-&gt;key = NULL;
&gt;&nbsp; upcall-&gt;key_len = 0;
&gt;&nbsp; upcall-&gt;mru = mru;
&gt;
&gt;
&gt; On 16 Oct 2025, at 3:12, LIU Yulong wrote:
&gt;
&gt; &gt; We have such call stack of coredump:
&gt; &gt; *0&nbsp; 0x00007f7f197ae337 in raise () from /lib64/libc.so.6
&gt; &gt; *1&nbsp; 0x00007f7f197afa28 in abort () from /lib64/libc.so.6
&gt; &gt; *2&nbsp; 0x000055934ca4f4ee in ovs_abort_valist (err_no=<optimized 
out&gt;, format=<optimized out&gt;, args=args@entry=0x7f7f07530360) at 
lib/util.c:499
&gt; &gt; *3&nbsp; 0x000055934ca4f584 in ovs_abort (err_no=err_no@entry=0, 
format=format@entry=0x55934ccedd18 "%s: %s() passed uninitialized ovs_mutex") 
at lib/util.c:491
&gt; &gt; *4&nbsp; 0x000055934ca1a4a1 in ovs_mutex_trylock_at 
(l_=l_@entry=0x7f7ed4a43e58, where=where@entry=0x55934cccb318 
"ofproto/ofproto-dpif-upcall.c:3014") at lib/ovs-thread.c:106
&gt; &gt; *5&nbsp; 0x000055934c943181 in revalidator_sweep__ 
(revalidator=revalidator@entry=0x5593518c1720, purge=purge@entry=false) at 
ofproto/ofproto-dpif-upcall.c:3014
&gt; &gt; *6&nbsp; 0x000055934c9471a6 in revalidator_sweep 
(revalidator=0x5593518c1720) at ofproto/ofproto-dpif-upcall.c:3072
&gt; &gt; *7&nbsp; udpif_revalidator (arg=0x5593518c1720) at 
ofproto/ofproto-dpif-upcall.c:1086
&gt; &gt; *8&nbsp; 0x000055934ca1b05f in ovsthread_wrapper (aux_=<optimized 
out&gt;) at lib/ovs-thread.c:422
&gt; &gt; *9&nbsp; 0x00007f7f1b6ece65 in start_thread () from 
/lib64/libpthread.so.0
&gt; &gt; *10 0x00007f7f1987688d in clone () from /lib64/libc.so.6
&gt; &gt;
&gt; &gt; When calling ovs_mutex_trylock() on ukey-&gt;mutex, 
ovs_mutex_trylock_at
&gt; &gt; sees that the input is an "uninitialized ovs_mutex" (l-&gt;where is 
NULL),
&gt; &gt; and aborts.
&gt; &gt;
&gt; &gt; This state can only occur after the mutex has not been initialized or
&gt; &gt; has been destroyed. The mutex is definitely initialized in 
ukey_create__
&gt; &gt; by ukey, so the "uninitialized" state is almost certainly "destroyed".
&gt; &gt; Destruction occurs in ukey_delete__, which calls 
ovs_mutex_destroy(&amp;ukey-&gt;mutex)
&gt; &gt; and sets where to NULL.
&gt; &gt;
&gt; &gt; When revalidator_sweep__ is traversing cmap and trying to lock 
(&amp;ukey-&gt;mutex),
&gt; &gt; it encounters a ukey that has been directly destroyed by 
ukey_delete__,
&gt; &gt; However, the ukey is still visible to the revalidator (either still in
&gt; &gt; the cmap or has not yet passed the RCU grace period), resulting in an
&gt; &gt; abort. That is to say, there is a path where ukey_delete__ is directly
&gt; &gt; called during concurrent traversal of ukey, bypassing the
&gt; &gt; cmap_remove + ovsrcu_postpone semantics of ukey_delete.
&gt; &gt;
&gt; &gt; Modify upcall_uninit() to change direct ukey_delete__ to RCU deferred
&gt; &gt; release to avoid concurrent traversal conflicts with revalidator.
&gt; &gt; This ensures that ukey_delete__ is not executed until after the
&gt; &gt; global grace period, and that the CMAP_FOR_EACH within
&gt; &gt; revalidator_sweep__ will not encounter a destroyed mutex before
&gt; &gt; the end of running cycle.
&gt; &gt;
&gt; &gt; Some earlier email discussions:
&gt; &gt; [1] 
https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
&gt; &gt; [2] 
https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
&gt; &gt; [3] 
https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html
&gt; &gt;
&gt; &gt; Signed-off-by: LIU Yulong <[email protected]&gt;
&gt; &gt; ---
&gt; &gt;&nbsp; ofproto/ofproto-dpif-upcall.c | 2 +-
&gt; &gt;&nbsp; 1 file changed, 1 insertion(+), 1 deletion(-)
&gt; &gt;
&gt; &gt; diff --git a/ofproto/ofproto-dpif-upcall.c 
b/ofproto/ofproto-dpif-upcall.c
&gt; &gt; index 9dfa52d82..b3b4b2d2f 100644
&gt; &gt; --- a/ofproto/ofproto-dpif-upcall.c
&gt; &gt; +++ b/ofproto/ofproto-dpif-upcall.c
&gt; &gt; @@ -1386,7 +1386,7 @@ upcall_uninit(struct upcall *upcall)
&gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; 
ofpbuf_uninit(&amp;upcall-&gt;put_actions);
&gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; if (upcall-&gt;ukey) 
{
&gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp; if (!upcall-&gt;ukey_persists) {
&gt; &gt; 
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp; ukey_delete__(upcall-&gt;ukey);
&gt; &gt; 
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp; ovsrcu_postpone(ukey_delete__, upcall-&gt;ukey);
&gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp; }
&gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; } else if 
(upcall-&gt;have_recirc_ref) {
&gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp; /* The reference was transferred to the ukey if one was created. */
&gt; &gt; --
&gt; &gt; 2.50.1 (Apple Git-155)
&gt; &gt;
&gt; &gt; _______________________________________________
&gt; &gt; dev mailing list
&gt; &gt; [email protected]
&gt; &gt; https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to