Hi,
Maybe this fix should be the root cause of: ovs-vswitchd core at revalidator_sweep__ https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html Did you see such core at revalidator_sweep__ ? Regards, LIU Yulong ------------------ Original ------------------ From: "Eelco Chaudron"<[email protected]>; Date: Wed, Jul 3, 2024 06:46 PM To: "Roi Dayan"<[email protected]>; Cc: "dev"<[email protected]>; "Maor Dickman"<[email protected]>; Subject: Re: [ovs-dev] [PATCH 1/1] ofproto-dpif-upcall: Avoid stale ukeys leaks. On 3 Jul 2024, at 12:22, Roi Dayan wrote: > On 03/07/2024 13:01, Eelco Chaudron wrote: >> >> >> On 3 Jul 2024, at 9:31, Roi Dayan wrote: >> >>> On 18/06/2024 11:53, Eelco Chaudron wrote: >>>> >>>> >>>> On 18 Jun 2024, at 8:05, Roi Dayan wrote: >>>> >>>>> On 03/06/2024 16:29, Eelco Chaudron wrote: >>>>>> >>>>>> >>>>>> On 3 Jun 2024, at 10:07, Roi Dayan wrote: >>>>>> >>>>>>> On 03/06/2024 10:18, Roi Dayan wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 30/05/2024 18:48, Eelco Chaudron wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote: >>>>>>>>> >>>>>>>>>> It is observed in some environments that there are much more ukeys than >>>>>>>>>> actual DP flows. For example: >>>>>>>>>> >>>>>>>>>> $ ovs-appctl upcall/show >>>>>>>>>> system@ovs-system: >>>>>>>>>> flows : (current 7) (avg 6) (max 117) (limit 2125) >>>>>>>>>> offloaded flows : 525 >>>>>>>>>> dump duration : 1063ms >>>>>>>>>> ufid enabled : true >>>>>>>>>> >>>>>>>>>> 23: (keys 3612) >>>>>>>>>> 24: (keys 3625) >>>>>>>>>> 25: (keys 3485) >>>>>>>>>> >>>>>>>>>> The revalidator threads are busy revalidating the stale ukeys leading to >>>>>>>>>> high CPU and long dump duration. >>>>>>>>>> >>>>>>>>>> There are some possible situations that may result in stale ukeys that >>>>>>>>>> have no corresponding DP flows. >>>>>>>>>> >>>>>>>>>> In revalidator, push_dp_ops() doesn't check error if the op type is not >>>>>>>>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc offload >>>>>>>>>> case, where the old flow is deleted first and then the new one is >>>>>>>>>> created. If the creation fails, the ukey will be stale (no corresponding >>>>>>>>>> DP flow). This patch adds a warning in such case. >>>>>>>>> >>>>>>>>> Not sure I understand, this behavior should be captured by the UKEY_INCONSISTENT state. >>>>>>>> >>>>>>>> Hi Eelco, >>>>>>>> >>>>>>>> thanks for reviewing. >>>>>>>> We started the patch on older branch as we didn't rebase for some time >>>>>>>> and got a little later on sending it. >>>>>>>> I see the addition now for UKEY_INCOSISTENT in the following patch: >>>>>>>> >>>>>>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors. >>>>>>>> >>>>>>>> Looks like it handles the same situation we tried to handle in this patch. >>>>>>>> >>>>>>>>> >>>>>>>>>> Another possible scenario is in handle_upcalls() if a PUT operation did >>>>>>>>>> not succeed and op->error attribute was not set correctly it can lead to >>>>>>>>>> stale ukey in operational state. >>>>>>>>> >>>>>>>>> >>>>>>>>> Guess we need to figure out these cases, as these are the root cause of your problem. >>>>>>>> >>>>>>>> right. maybe. but this could help keep the system alive/clean while logging the >>>>>>>> bad situation instead of having increase in those stale/inconsistent ukeys. >>>>>>>> I understand if it's not nice to handle it like this. >>>>>>>> >>>>>>> >>>>>>> Hi Eelco, >>>>>>> >>>>>>> I remember now one of the reproduction scenarios we did is do some traffic >>>>>>> to get rules added using TC and then delete those from tc command line >>>>>>> and it got to stale ukeys. >>>>>>> The revalidator dump thread not seeing the rules so not updating anything >>>>>>> and sweep over the ukeys skipping them as well. >>>>>>> This is why we checked against the timing stats of the ukey. >>>>>>> >>>>>>> I remember I tested on the upstream code and not our development branch >>>>>>> and it reproduces. I didn't notice if the commit adding UKEY_INCONSISTENT >>>>>>> existed but it handles errors from adding flows so I dont think it matters. >>>>>>> >>>>>>> I'll try to check and verify again but I think it's still here. >>>>>>> So while cases getting dop.error handled with UKEY_INCONSISTENT, >>>>>>> this case I think is not. >>>>>> >>>>>> I think you are right this case is not covered by the UKEY_INCONSISTENT test below. See my suggestion on fixing this in revalidate_ukey(). >>>>>> >>>>>> Maybe you could also add a test case for this in the offload test suite. >>>>>> >>>>>> //Eelco >>>>> >>>>> Hi Eelco, >>>>> >>>>> Thanks for the review. I didn't have time to check this but wanted to >>>>> reply that it's in my queue to verify your suggestion and add a test. >>>> >>>> Thanks for letting me know, and I understand as we are all busy :) >>>> >>>> //Eelco >>>> >>> >>> >>> Hi Eelco, >>> >>> I tested your suggestion to move to against removing tc rules >>> from tc command I see it's helpful in cleaning those stale ukeys. >>> I have trouble doing a clean test for this for the testsuite. >> >> With the tc rule deletion, it becomes an offload-specific test case, but as this is the only way we’ve seen the issue, it might be fine to add it just there. >> > > yes. after the first half of the fix you pointed out was implemented already. > now the way i reproduce it is deleting tc rules. > could potentially happen from something else or because of another bug somewhere. > >>> At first I tested with modification in revalidator_sweep__() to >>> always set seq_mismatch to reach revalidate_ukey(). >>> Later I moved to create a redundant veth and doing add/del loop >>> of it to the bridge to cause a seq mismatch. - sure this is >>> the clean way but without the change I get stale ukeys and with >>> it we get to the cleanup change and cleaning the ukeys. >> >> Can you maybe share the test case so I know what you are doing? Is this working for the general system tests? > > I did this: > > - create bridge with 2 veth ports. configure ips. > - ping between the ports to have tc rules. > - repeated few times: clear the tc rules like this: tc filter del dev veth1 ingress and also on the 2nd port. > - set max-idle to 1 and remove it to cause a flush of the rules. > - create another set of veth ports. add/del veth4 from the bridge. to cause a sweep. > - before the fix: ovs-appctl upcall/show will show ukeys. > - after the fix upcall/show will show 0 ukeys. Thanks, will find some time to test this, and reply to the V2. //Eelco >>> I'll send v2 for now as a clean patch with only the relevant change >>> and cleaner commit msg and lets take it from there if my suggestion >>> here of the test is ok or something else or the test can be postponed >>> to a later time to think of a cleaner/right way to do it. >> >> You missed the change in the python script as explained in the definition header. I will reply to the patch. > > oh right. i'll check it. thanks. > >> >> //Eelco >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
