On 3 Jul 2024, at 9:31, Roi Dayan wrote:

> On 18/06/2024 11:53, Eelco Chaudron wrote:
>>
>>
>> On 18 Jun 2024, at 8:05, Roi Dayan wrote:
>>
>>> On 03/06/2024 16:29, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 3 Jun 2024, at 10:07, Roi Dayan wrote:
>>>>
>>>>> On 03/06/2024 10:18, Roi Dayan wrote:
>>>>>>
>>>>>>
>>>>>> On 30/05/2024 18:48, Eelco Chaudron wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote:
>>>>>>>
>>>>>>>> It is observed in some environments that there are much more ukeys than
>>>>>>>> actual DP flows. For example:
>>>>>>>>
>>>>>>>> $ ovs-appctl upcall/show
>>>>>>>> system@ovs-system:
>>>>>>>> flows : (current 7) (avg 6) (max 117) (limit 2125)
>>>>>>>> offloaded flows : 525
>>>>>>>> dump duration : 1063ms
>>>>>>>> ufid enabled : true
>>>>>>>>
>>>>>>>> 23: (keys 3612)
>>>>>>>> 24: (keys 3625)
>>>>>>>> 25: (keys 3485)
>>>>>>>>
>>>>>>>> The revalidator threads are busy revalidating the stale ukeys leading 
>>>>>>>> to
>>>>>>>> high CPU and long dump duration.
>>>>>>>>
>>>>>>>> There are some possible situations that may result in stale ukeys that
>>>>>>>> have no corresponding DP flows.
>>>>>>>>
>>>>>>>> In revalidator, push_dp_ops() doesn't check error if the op type is not
>>>>>>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc offload
>>>>>>>> case, where the old flow is deleted first and then the new one is
>>>>>>>> created. If the creation fails, the ukey will be stale (no 
>>>>>>>> corresponding
>>>>>>>> DP flow). This patch adds a warning in such case.
>>>>>>>
>>>>>>> Not sure I understand, this behavior should be captured by the 
>>>>>>> UKEY_INCONSISTENT state.
>>>>>>
>>>>>> Hi Eelco,
>>>>>>
>>>>>> thanks for reviewing.
>>>>>> We started the patch on older branch as we didn't rebase for some time
>>>>>> and got a little later on sending it.
>>>>>> I see the addition now for UKEY_INCOSISTENT in the following patch:
>>>>>>
>>>>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors.
>>>>>>
>>>>>> Looks like it handles the same situation we tried to handle in this 
>>>>>> patch.
>>>>>>
>>>>>>>
>>>>>>>> Another possible scenario is in handle_upcalls() if a PUT operation did
>>>>>>>> not succeed and op->error attribute was not set correctly it can lead 
>>>>>>>> to
>>>>>>>> stale ukey in operational state.
>>>>>>>
>>>>>>>
>>>>>>> Guess we need to figure out these cases, as these are the root cause of 
>>>>>>> your problem.
>>>>>>
>>>>>> right. maybe. but this could help keep the system alive/clean while 
>>>>>> logging the
>>>>>> bad situation instead of having increase in those stale/inconsistent 
>>>>>> ukeys.
>>>>>> I understand if it's not nice to handle it like this.
>>>>>>
>>>>>
>>>>> Hi Eelco,
>>>>>
>>>>> I remember now one of the reproduction scenarios we did is do some traffic
>>>>> to get rules added using TC and then delete those from tc command line
>>>>> and it got to stale ukeys.
>>>>> The revalidator dump thread not seeing the rules so not updating anything
>>>>> and sweep over the ukeys skipping them as well.
>>>>> This is why we checked against the timing stats of the ukey.
>>>>>
>>>>> I remember I tested on the upstream code and not our development branch
>>>>> and it reproduces. I didn't notice if the commit adding UKEY_INCONSISTENT
>>>>> existed but it handles errors from adding flows so I dont think it 
>>>>> matters.
>>>>>
>>>>> I'll try to check and verify again but I think it's still here.
>>>>> So while cases getting dop.error handled with UKEY_INCONSISTENT,
>>>>> this case I think is not.
>>>>
>>>> I think you are right this case is not covered by the UKEY_INCONSISTENT 
>>>> test below. See my suggestion on fixing this in revalidate_ukey().
>>>>
>>>> Maybe you could also add a test case for this in the offload test suite.
>>>>
>>>> //Eelco
>>>
>>> Hi Eelco,
>>>
>>> Thanks for the review. I didn't have time to check this but wanted to
>>> reply that it's in my queue to verify your suggestion and add a test.
>>
>> Thanks for letting me know, and I understand as we are all busy :)
>>
>> //Eelco
>>
>
>
> Hi Eelco,
>
> I tested your suggestion to move to against removing tc rules
> from tc command I see it's helpful in cleaning those stale ukeys.
> I have trouble doing a clean test for this for the testsuite.

With the tc rule deletion, it becomes an offload-specific test case, but as 
this is the only way we’ve seen the issue, it might be fine to add it just 
there.

> At first I tested with modification in revalidator_sweep__() to
> always set seq_mismatch to reach revalidate_ukey().
> Later I moved to create a redundant veth and doing add/del loop
> of it to the bridge to cause a seq mismatch. - sure this is
> the clean way but without the change I get stale ukeys and with
> it we get to the cleanup change and cleaning the ukeys.

Can you maybe share the test case so I know what you are doing? Is this working 
for the general system tests?

> I'll send v2 for now as a clean patch with only the relevant change
> and cleaner commit msg and lets take it from there if my suggestion
> here of the test is ok or something else or the test can be postponed
> to a later time to think of a cleaner/right way to do it.

You missed the change in the python script as explained in the definition 
header. I will reply to the patch.

//Eelco

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to