Hi,

Maybe this fix should be the root cause of:
ovs-vswitchd core at revalidator_sweep__
https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html


Did you see such core at revalidator_sweep__ ?


Regards,


LIU Yulong


 
 
------------------ Original ------------------
From: &nbsp;"Eelco&nbsp;Chaudron"<[email protected]&gt;;
Date: &nbsp;Wed, Jul 3, 2024 06:46 PM
To: &nbsp;"Roi Dayan"<[email protected]&gt;; 
Cc: &nbsp;"dev"<[email protected]&gt;; "Maor Dickman"<[email protected]&gt;; 
Subject: &nbsp;Re: [ovs-dev] [PATCH 1/1] ofproto-dpif-upcall: Avoid stale ukeys 
leaks.

&nbsp;



On 3 Jul 2024, at 12:22, Roi Dayan wrote:

&gt; On 03/07/2024 13:01, Eelco Chaudron wrote:
&gt;&gt;
&gt;&gt;
&gt;&gt; On 3 Jul 2024, at 9:31, Roi Dayan wrote:
&gt;&gt;
&gt;&gt;&gt; On 18/06/2024 11:53, Eelco Chaudron wrote:
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt; On 18 Jun 2024, at 8:05, Roi Dayan wrote:
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt; On 03/06/2024 16:29, Eelco Chaudron wrote:
&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt; On 3 Jun 2024, at 10:07, Roi Dayan wrote:
&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt; On 03/06/2024 10:18, Roi Dayan wrote:
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On 30/05/2024 18:48, Eelco Chaudron wrote:
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On 23 May 2024, at 12:46, Roi Dayan via 
dev wrote:
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; It is observed in some environments 
that there are much more ukeys than
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; actual DP flows. For example:
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; $ ovs-appctl upcall/show
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; system@ovs-system:
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; flows : (current 7) (avg 6) (max 117) 
(limit 2125)
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; offloaded flows : 525
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; dump duration : 1063ms
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; ufid enabled : true
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 23: (keys 3612)
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 24: (keys 3625)
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 25: (keys 3485)
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; The revalidator threads are busy 
revalidating the stale ukeys leading to
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; high CPU and long dump duration.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; There are some possible situations 
that may result in stale ukeys that
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; have no corresponding DP flows.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; In revalidator, push_dp_ops() doesn't 
check error if the op type is not
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; DEL. It is possible that a PUT(MODIFY) 
fails, especially for tc offload
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; case, where the old flow is deleted 
first and then the new one is
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; created. If the creation fails, the 
ukey will be stale (no corresponding
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; DP flow). This patch adds a warning in 
such case.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Not sure I understand, this behavior 
should be captured by the UKEY_INCONSISTENT state.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hi Eelco,
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; thanks for reviewing.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; We started the patch on older branch as we 
didn't rebase for some time
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; and got a little later on sending it.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I see the addition now for UKEY_INCOSISTENT in 
the following patch:
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; cf11766cbcf1 ofproto-dpif-upcall: Fix 
push_dp_ops to handle all errors.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Looks like it handles the same situation we 
tried to handle in this patch.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Another possible scenario is in 
handle_upcalls() if a PUT operation did
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; not succeed and op-&gt;error attribute 
was not set correctly it can lead to
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; stale ukey in operational state.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Guess we need to figure out these cases, 
as these are the root cause of your problem.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; right. maybe. but this could help keep the 
system alive/clean while logging the
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; bad situation instead of having increase in 
those stale/inconsistent ukeys.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I understand if it's not nice to handle it 
like this.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hi Eelco,
&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I remember now one of the reproduction scenarios 
we did is do some traffic
&gt;&gt;&gt;&gt;&gt;&gt;&gt; to get rules added using TC and then delete those 
from tc command line
&gt;&gt;&gt;&gt;&gt;&gt;&gt; and it got to stale ukeys.
&gt;&gt;&gt;&gt;&gt;&gt;&gt; The revalidator dump thread not seeing the rules 
so not updating anything
&gt;&gt;&gt;&gt;&gt;&gt;&gt; and sweep over the ukeys skipping them as well.
&gt;&gt;&gt;&gt;&gt;&gt;&gt; This is why we checked against the timing stats of 
the ukey.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I remember I tested on the upstream code and not 
our development branch
&gt;&gt;&gt;&gt;&gt;&gt;&gt; and it reproduces. I didn't notice if the commit 
adding UKEY_INCONSISTENT
&gt;&gt;&gt;&gt;&gt;&gt;&gt; existed but it handles errors from adding flows so 
I dont think it matters.
&gt;&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I'll try to check and verify again but I think 
it's still here.
&gt;&gt;&gt;&gt;&gt;&gt;&gt; So while cases getting dop.error handled with 
UKEY_INCONSISTENT,
&gt;&gt;&gt;&gt;&gt;&gt;&gt; this case I think is not.
&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt; I think you are right this case is not covered by the 
UKEY_INCONSISTENT test below. See my suggestion on fixing this in 
revalidate_ukey().
&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt; Maybe you could also add a test case for this in the 
offload test suite.
&gt;&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt;&gt; //Eelco
&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt; Hi Eelco,
&gt;&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt;&gt; Thanks for the review. I didn't have time to check this 
but wanted to
&gt;&gt;&gt;&gt;&gt; reply that it's in my queue to verify your suggestion and 
add a test.
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt; Thanks for letting me know, and I understand as we are all 
busy :)
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;&gt; //Eelco
&gt;&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt; Hi Eelco,
&gt;&gt;&gt;
&gt;&gt;&gt; I tested your suggestion to move to against removing tc rules
&gt;&gt;&gt; from tc command I see it's helpful in cleaning those stale ukeys.
&gt;&gt;&gt; I have trouble doing a clean test for this for the testsuite.
&gt;&gt;
&gt;&gt; With the tc rule deletion, it becomes an offload-specific test case, 
but as this is the only way we’ve seen the issue, it might be fine to add it 
just there.
&gt;&gt;
&gt;
&gt; yes. after the first half of the fix you pointed out was implemented 
already.
&gt; now the way i reproduce it is deleting tc rules.
&gt; could potentially happen from something else or because of another bug 
somewhere.
&gt;
&gt;&gt;&gt; At first I tested with modification in revalidator_sweep__() to
&gt;&gt;&gt; always set seq_mismatch to reach revalidate_ukey().
&gt;&gt;&gt; Later I moved to create a redundant veth and doing add/del loop
&gt;&gt;&gt; of it to the bridge to cause a seq mismatch. - sure this is
&gt;&gt;&gt; the clean way but without the change I get stale ukeys and with
&gt;&gt;&gt; it we get to the cleanup change and cleaning the ukeys.
&gt;&gt;
&gt;&gt; Can you maybe share the test case so I know what you are doing? Is 
this working for the general system tests?
&gt;
&gt; I did this:
&gt;
&gt; - create bridge with 2 veth ports. configure ips.
&gt; - ping between the ports to have tc rules.
&gt; - repeated few times: clear the tc rules like this: tc filter del dev 
veth1 ingress and also on the 2nd port.
&gt; - set max-idle to 1 and remove it to cause a flush of the rules.
&gt; - create another set of veth ports. add/del veth4 from the bridge. to 
cause a sweep.
&gt; - before the fix: ovs-appctl upcall/show will show ukeys.
&gt; - after the fix upcall/show will show 0 ukeys.

Thanks, will find some time to test this, and reply to the V2.

//Eelco

&gt;&gt;&gt; I'll send v2 for now as a clean patch with only the relevant change
&gt;&gt;&gt; and cleaner commit msg and lets take it from there if my suggestion
&gt;&gt;&gt; here of the test is ok or something else or the test can be 
postponed
&gt;&gt;&gt; to a later time to think of a cleaner/right way to do it.
&gt;&gt;
&gt;&gt; You missed the change in the python script as explained in the 
definition header. I will reply to the patch.
&gt;
&gt; oh right. i'll check it. thanks.
&gt;
&gt;&gt;
&gt;&gt; //Eelco
&gt;&gt;

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to