On 03/07/2024 13:01, Eelco Chaudron wrote:
> 
> 
> On 3 Jul 2024, at 9:31, Roi Dayan wrote:
> 
>> On 18/06/2024 11:53, Eelco Chaudron wrote:
>>>
>>>
>>> On 18 Jun 2024, at 8:05, Roi Dayan wrote:
>>>
>>>> On 03/06/2024 16:29, Eelco Chaudron wrote:
>>>>>
>>>>>
>>>>> On 3 Jun 2024, at 10:07, Roi Dayan wrote:
>>>>>
>>>>>> On 03/06/2024 10:18, Roi Dayan wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 30/05/2024 18:48, Eelco Chaudron wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote:
>>>>>>>>
>>>>>>>>> It is observed in some environments that there are much more ukeys 
>>>>>>>>> than
>>>>>>>>> actual DP flows. For example:
>>>>>>>>>
>>>>>>>>> $ ovs-appctl upcall/show
>>>>>>>>> system@ovs-system:
>>>>>>>>> flows : (current 7) (avg 6) (max 117) (limit 2125)
>>>>>>>>> offloaded flows : 525
>>>>>>>>> dump duration : 1063ms
>>>>>>>>> ufid enabled : true
>>>>>>>>>
>>>>>>>>> 23: (keys 3612)
>>>>>>>>> 24: (keys 3625)
>>>>>>>>> 25: (keys 3485)
>>>>>>>>>
>>>>>>>>> The revalidator threads are busy revalidating the stale ukeys leading 
>>>>>>>>> to
>>>>>>>>> high CPU and long dump duration.
>>>>>>>>>
>>>>>>>>> There are some possible situations that may result in stale ukeys that
>>>>>>>>> have no corresponding DP flows.
>>>>>>>>>
>>>>>>>>> In revalidator, push_dp_ops() doesn't check error if the op type is 
>>>>>>>>> not
>>>>>>>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc 
>>>>>>>>> offload
>>>>>>>>> case, where the old flow is deleted first and then the new one is
>>>>>>>>> created. If the creation fails, the ukey will be stale (no 
>>>>>>>>> corresponding
>>>>>>>>> DP flow). This patch adds a warning in such case.
>>>>>>>>
>>>>>>>> Not sure I understand, this behavior should be captured by the 
>>>>>>>> UKEY_INCONSISTENT state.
>>>>>>>
>>>>>>> Hi Eelco,
>>>>>>>
>>>>>>> thanks for reviewing.
>>>>>>> We started the patch on older branch as we didn't rebase for some time
>>>>>>> and got a little later on sending it.
>>>>>>> I see the addition now for UKEY_INCOSISTENT in the following patch:
>>>>>>>
>>>>>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors.
>>>>>>>
>>>>>>> Looks like it handles the same situation we tried to handle in this 
>>>>>>> patch.
>>>>>>>
>>>>>>>>
>>>>>>>>> Another possible scenario is in handle_upcalls() if a PUT operation 
>>>>>>>>> did
>>>>>>>>> not succeed and op->error attribute was not set correctly it can lead 
>>>>>>>>> to
>>>>>>>>> stale ukey in operational state.
>>>>>>>>
>>>>>>>>
>>>>>>>> Guess we need to figure out these cases, as these are the root cause 
>>>>>>>> of your problem.
>>>>>>>
>>>>>>> right. maybe. but this could help keep the system alive/clean while 
>>>>>>> logging the
>>>>>>> bad situation instead of having increase in those stale/inconsistent 
>>>>>>> ukeys.
>>>>>>> I understand if it's not nice to handle it like this.
>>>>>>>
>>>>>>
>>>>>> Hi Eelco,
>>>>>>
>>>>>> I remember now one of the reproduction scenarios we did is do some 
>>>>>> traffic
>>>>>> to get rules added using TC and then delete those from tc command line
>>>>>> and it got to stale ukeys.
>>>>>> The revalidator dump thread not seeing the rules so not updating anything
>>>>>> and sweep over the ukeys skipping them as well.
>>>>>> This is why we checked against the timing stats of the ukey.
>>>>>>
>>>>>> I remember I tested on the upstream code and not our development branch
>>>>>> and it reproduces. I didn't notice if the commit adding UKEY_INCONSISTENT
>>>>>> existed but it handles errors from adding flows so I dont think it 
>>>>>> matters.
>>>>>>
>>>>>> I'll try to check and verify again but I think it's still here.
>>>>>> So while cases getting dop.error handled with UKEY_INCONSISTENT,
>>>>>> this case I think is not.
>>>>>
>>>>> I think you are right this case is not covered by the UKEY_INCONSISTENT 
>>>>> test below. See my suggestion on fixing this in revalidate_ukey().
>>>>>
>>>>> Maybe you could also add a test case for this in the offload test suite.
>>>>>
>>>>> //Eelco
>>>>
>>>> Hi Eelco,
>>>>
>>>> Thanks for the review. I didn't have time to check this but wanted to
>>>> reply that it's in my queue to verify your suggestion and add a test.
>>>
>>> Thanks for letting me know, and I understand as we are all busy :)
>>>
>>> //Eelco
>>>
>>
>>
>> Hi Eelco,
>>
>> I tested your suggestion to move to against removing tc rules
>> from tc command I see it's helpful in cleaning those stale ukeys.
>> I have trouble doing a clean test for this for the testsuite.
> 
> With the tc rule deletion, it becomes an offload-specific test case, but as 
> this is the only way we’ve seen the issue, it might be fine to add it just 
> there.
> 

yes. after the first half of the fix you pointed out was implemented already.
now the way i reproduce it is deleting tc rules.
could potentially happen from something else or because of another bug 
somewhere.

>> At first I tested with modification in revalidator_sweep__() to
>> always set seq_mismatch to reach revalidate_ukey().
>> Later I moved to create a redundant veth and doing add/del loop
>> of it to the bridge to cause a seq mismatch. - sure this is
>> the clean way but without the change I get stale ukeys and with
>> it we get to the cleanup change and cleaning the ukeys.
> 
> Can you maybe share the test case so I know what you are doing? Is this 
> working for the general system tests?

I did this:

- create bridge with 2 veth ports. configure ips.
- ping between the ports to have tc rules.
- repeated few times: clear the tc rules like this: tc filter del dev veth1 
ingress and also on the 2nd port.
- set max-idle to 1 and remove it to cause a flush of the rules.
- create another set of veth ports. add/del veth4 from the bridge. to cause a 
sweep.
- before the fix: ovs-appctl upcall/show will show ukeys.
- after the fix upcall/show will show 0 ukeys.

> 
>> I'll send v2 for now as a clean patch with only the relevant change
>> and cleaner commit msg and lets take it from there if my suggestion
>> here of the test is ok or something else or the test can be postponed
>> to a later time to think of a cleaner/right way to do it.
> 
> You missed the change in the python script as explained in the definition 
> header. I will reply to the patch.

oh right. i'll check it. thanks.

> 
> //Eelco
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to