On 03/07/2024 13:01, Eelco Chaudron wrote: > > > On 3 Jul 2024, at 9:31, Roi Dayan wrote: > >> On 18/06/2024 11:53, Eelco Chaudron wrote: >>> >>> >>> On 18 Jun 2024, at 8:05, Roi Dayan wrote: >>> >>>> On 03/06/2024 16:29, Eelco Chaudron wrote: >>>>> >>>>> >>>>> On 3 Jun 2024, at 10:07, Roi Dayan wrote: >>>>> >>>>>> On 03/06/2024 10:18, Roi Dayan wrote: >>>>>>> >>>>>>> >>>>>>> On 30/05/2024 18:48, Eelco Chaudron wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote: >>>>>>>> >>>>>>>>> It is observed in some environments that there are much more ukeys >>>>>>>>> than >>>>>>>>> actual DP flows. For example: >>>>>>>>> >>>>>>>>> $ ovs-appctl upcall/show >>>>>>>>> system@ovs-system: >>>>>>>>> flows : (current 7) (avg 6) (max 117) (limit 2125) >>>>>>>>> offloaded flows : 525 >>>>>>>>> dump duration : 1063ms >>>>>>>>> ufid enabled : true >>>>>>>>> >>>>>>>>> 23: (keys 3612) >>>>>>>>> 24: (keys 3625) >>>>>>>>> 25: (keys 3485) >>>>>>>>> >>>>>>>>> The revalidator threads are busy revalidating the stale ukeys leading >>>>>>>>> to >>>>>>>>> high CPU and long dump duration. >>>>>>>>> >>>>>>>>> There are some possible situations that may result in stale ukeys that >>>>>>>>> have no corresponding DP flows. >>>>>>>>> >>>>>>>>> In revalidator, push_dp_ops() doesn't check error if the op type is >>>>>>>>> not >>>>>>>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc >>>>>>>>> offload >>>>>>>>> case, where the old flow is deleted first and then the new one is >>>>>>>>> created. If the creation fails, the ukey will be stale (no >>>>>>>>> corresponding >>>>>>>>> DP flow). This patch adds a warning in such case. >>>>>>>> >>>>>>>> Not sure I understand, this behavior should be captured by the >>>>>>>> UKEY_INCONSISTENT state. >>>>>>> >>>>>>> Hi Eelco, >>>>>>> >>>>>>> thanks for reviewing. >>>>>>> We started the patch on older branch as we didn't rebase for some time >>>>>>> and got a little later on sending it. >>>>>>> I see the addition now for UKEY_INCOSISTENT in the following patch: >>>>>>> >>>>>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors. >>>>>>> >>>>>>> Looks like it handles the same situation we tried to handle in this >>>>>>> patch. >>>>>>> >>>>>>>> >>>>>>>>> Another possible scenario is in handle_upcalls() if a PUT operation >>>>>>>>> did >>>>>>>>> not succeed and op->error attribute was not set correctly it can lead >>>>>>>>> to >>>>>>>>> stale ukey in operational state. >>>>>>>> >>>>>>>> >>>>>>>> Guess we need to figure out these cases, as these are the root cause >>>>>>>> of your problem. >>>>>>> >>>>>>> right. maybe. but this could help keep the system alive/clean while >>>>>>> logging the >>>>>>> bad situation instead of having increase in those stale/inconsistent >>>>>>> ukeys. >>>>>>> I understand if it's not nice to handle it like this. >>>>>>> >>>>>> >>>>>> Hi Eelco, >>>>>> >>>>>> I remember now one of the reproduction scenarios we did is do some >>>>>> traffic >>>>>> to get rules added using TC and then delete those from tc command line >>>>>> and it got to stale ukeys. >>>>>> The revalidator dump thread not seeing the rules so not updating anything >>>>>> and sweep over the ukeys skipping them as well. >>>>>> This is why we checked against the timing stats of the ukey. >>>>>> >>>>>> I remember I tested on the upstream code and not our development branch >>>>>> and it reproduces. I didn't notice if the commit adding UKEY_INCONSISTENT >>>>>> existed but it handles errors from adding flows so I dont think it >>>>>> matters. >>>>>> >>>>>> I'll try to check and verify again but I think it's still here. >>>>>> So while cases getting dop.error handled with UKEY_INCONSISTENT, >>>>>> this case I think is not. >>>>> >>>>> I think you are right this case is not covered by the UKEY_INCONSISTENT >>>>> test below. See my suggestion on fixing this in revalidate_ukey(). >>>>> >>>>> Maybe you could also add a test case for this in the offload test suite. >>>>> >>>>> //Eelco >>>> >>>> Hi Eelco, >>>> >>>> Thanks for the review. I didn't have time to check this but wanted to >>>> reply that it's in my queue to verify your suggestion and add a test. >>> >>> Thanks for letting me know, and I understand as we are all busy :) >>> >>> //Eelco >>> >> >> >> Hi Eelco, >> >> I tested your suggestion to move to against removing tc rules >> from tc command I see it's helpful in cleaning those stale ukeys. >> I have trouble doing a clean test for this for the testsuite. > > With the tc rule deletion, it becomes an offload-specific test case, but as > this is the only way we’ve seen the issue, it might be fine to add it just > there. >
yes. after the first half of the fix you pointed out was implemented already. now the way i reproduce it is deleting tc rules. could potentially happen from something else or because of another bug somewhere. >> At first I tested with modification in revalidator_sweep__() to >> always set seq_mismatch to reach revalidate_ukey(). >> Later I moved to create a redundant veth and doing add/del loop >> of it to the bridge to cause a seq mismatch. - sure this is >> the clean way but without the change I get stale ukeys and with >> it we get to the cleanup change and cleaning the ukeys. > > Can you maybe share the test case so I know what you are doing? Is this > working for the general system tests? I did this: - create bridge with 2 veth ports. configure ips. - ping between the ports to have tc rules. - repeated few times: clear the tc rules like this: tc filter del dev veth1 ingress and also on the 2nd port. - set max-idle to 1 and remove it to cause a flush of the rules. - create another set of veth ports. add/del veth4 from the bridge. to cause a sweep. - before the fix: ovs-appctl upcall/show will show ukeys. - after the fix upcall/show will show 0 ukeys. > >> I'll send v2 for now as a clean patch with only the relevant change >> and cleaner commit msg and lets take it from there if my suggestion >> here of the test is ok or something else or the test can be postponed >> to a later time to think of a cleaner/right way to do it. > > You missed the change in the python script as explained in the definition > header. I will reply to the patch. oh right. i'll check it. thanks. > > //Eelco > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
