On 3 Jul 2024, at 9:31, Roi Dayan wrote:
> On 18/06/2024 11:53, Eelco Chaudron wrote: >> >> >> On 18 Jun 2024, at 8:05, Roi Dayan wrote: >> >>> On 03/06/2024 16:29, Eelco Chaudron wrote: >>>> >>>> >>>> On 3 Jun 2024, at 10:07, Roi Dayan wrote: >>>> >>>>> On 03/06/2024 10:18, Roi Dayan wrote: >>>>>> >>>>>> >>>>>> On 30/05/2024 18:48, Eelco Chaudron wrote: >>>>>>> >>>>>>> >>>>>>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote: >>>>>>> >>>>>>>> It is observed in some environments that there are much more ukeys than >>>>>>>> actual DP flows. For example: >>>>>>>> >>>>>>>> $ ovs-appctl upcall/show >>>>>>>> system@ovs-system: >>>>>>>> flows : (current 7) (avg 6) (max 117) (limit 2125) >>>>>>>> offloaded flows : 525 >>>>>>>> dump duration : 1063ms >>>>>>>> ufid enabled : true >>>>>>>> >>>>>>>> 23: (keys 3612) >>>>>>>> 24: (keys 3625) >>>>>>>> 25: (keys 3485) >>>>>>>> >>>>>>>> The revalidator threads are busy revalidating the stale ukeys leading >>>>>>>> to >>>>>>>> high CPU and long dump duration. >>>>>>>> >>>>>>>> There are some possible situations that may result in stale ukeys that >>>>>>>> have no corresponding DP flows. >>>>>>>> >>>>>>>> In revalidator, push_dp_ops() doesn't check error if the op type is not >>>>>>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc offload >>>>>>>> case, where the old flow is deleted first and then the new one is >>>>>>>> created. If the creation fails, the ukey will be stale (no >>>>>>>> corresponding >>>>>>>> DP flow). This patch adds a warning in such case. >>>>>>> >>>>>>> Not sure I understand, this behavior should be captured by the >>>>>>> UKEY_INCONSISTENT state. >>>>>> >>>>>> Hi Eelco, >>>>>> >>>>>> thanks for reviewing. >>>>>> We started the patch on older branch as we didn't rebase for some time >>>>>> and got a little later on sending it. >>>>>> I see the addition now for UKEY_INCOSISTENT in the following patch: >>>>>> >>>>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors. >>>>>> >>>>>> Looks like it handles the same situation we tried to handle in this >>>>>> patch. >>>>>> >>>>>>> >>>>>>>> Another possible scenario is in handle_upcalls() if a PUT operation did >>>>>>>> not succeed and op->error attribute was not set correctly it can lead >>>>>>>> to >>>>>>>> stale ukey in operational state. >>>>>>> >>>>>>> >>>>>>> Guess we need to figure out these cases, as these are the root cause of >>>>>>> your problem. >>>>>> >>>>>> right. maybe. but this could help keep the system alive/clean while >>>>>> logging the >>>>>> bad situation instead of having increase in those stale/inconsistent >>>>>> ukeys. >>>>>> I understand if it's not nice to handle it like this. >>>>>> >>>>> >>>>> Hi Eelco, >>>>> >>>>> I remember now one of the reproduction scenarios we did is do some traffic >>>>> to get rules added using TC and then delete those from tc command line >>>>> and it got to stale ukeys. >>>>> The revalidator dump thread not seeing the rules so not updating anything >>>>> and sweep over the ukeys skipping them as well. >>>>> This is why we checked against the timing stats of the ukey. >>>>> >>>>> I remember I tested on the upstream code and not our development branch >>>>> and it reproduces. I didn't notice if the commit adding UKEY_INCONSISTENT >>>>> existed but it handles errors from adding flows so I dont think it >>>>> matters. >>>>> >>>>> I'll try to check and verify again but I think it's still here. >>>>> So while cases getting dop.error handled with UKEY_INCONSISTENT, >>>>> this case I think is not. >>>> >>>> I think you are right this case is not covered by the UKEY_INCONSISTENT >>>> test below. See my suggestion on fixing this in revalidate_ukey(). >>>> >>>> Maybe you could also add a test case for this in the offload test suite. >>>> >>>> //Eelco >>> >>> Hi Eelco, >>> >>> Thanks for the review. I didn't have time to check this but wanted to >>> reply that it's in my queue to verify your suggestion and add a test. >> >> Thanks for letting me know, and I understand as we are all busy :) >> >> //Eelco >> > > > Hi Eelco, > > I tested your suggestion to move to against removing tc rules > from tc command I see it's helpful in cleaning those stale ukeys. > I have trouble doing a clean test for this for the testsuite. With the tc rule deletion, it becomes an offload-specific test case, but as this is the only way we’ve seen the issue, it might be fine to add it just there. > At first I tested with modification in revalidator_sweep__() to > always set seq_mismatch to reach revalidate_ukey(). > Later I moved to create a redundant veth and doing add/del loop > of it to the bridge to cause a seq mismatch. - sure this is > the clean way but without the change I get stale ukeys and with > it we get to the cleanup change and cleaning the ukeys. Can you maybe share the test case so I know what you are doing? Is this working for the general system tests? > I'll send v2 for now as a clean patch with only the relevant change > and cleaner commit msg and lets take it from there if my suggestion > here of the test is ok or something else or the test can be postponed > to a later time to think of a cleaner/right way to do it. You missed the change in the python script as explained in the definition header. I will reply to the patch. //Eelco _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
