On 3 Jul 2024, at 12:22, Roi Dayan wrote:

> On 03/07/2024 13:01, Eelco Chaudron wrote:
>>
>>
>> On 3 Jul 2024, at 9:31, Roi Dayan wrote:
>>
>>> On 18/06/2024 11:53, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 18 Jun 2024, at 8:05, Roi Dayan wrote:
>>>>
>>>>> On 03/06/2024 16:29, Eelco Chaudron wrote:
>>>>>>
>>>>>>
>>>>>> On 3 Jun 2024, at 10:07, Roi Dayan wrote:
>>>>>>
>>>>>>> On 03/06/2024 10:18, Roi Dayan wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 30/05/2024 18:48, Eelco Chaudron wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote:
>>>>>>>>>
>>>>>>>>>> It is observed in some environments that there are much more ukeys 
>>>>>>>>>> than
>>>>>>>>>> actual DP flows. For example:
>>>>>>>>>>
>>>>>>>>>> $ ovs-appctl upcall/show
>>>>>>>>>> system@ovs-system:
>>>>>>>>>> flows : (current 7) (avg 6) (max 117) (limit 2125)
>>>>>>>>>> offloaded flows : 525
>>>>>>>>>> dump duration : 1063ms
>>>>>>>>>> ufid enabled : true
>>>>>>>>>>
>>>>>>>>>> 23: (keys 3612)
>>>>>>>>>> 24: (keys 3625)
>>>>>>>>>> 25: (keys 3485)
>>>>>>>>>>
>>>>>>>>>> The revalidator threads are busy revalidating the stale ukeys 
>>>>>>>>>> leading to
>>>>>>>>>> high CPU and long dump duration.
>>>>>>>>>>
>>>>>>>>>> There are some possible situations that may result in stale ukeys 
>>>>>>>>>> that
>>>>>>>>>> have no corresponding DP flows.
>>>>>>>>>>
>>>>>>>>>> In revalidator, push_dp_ops() doesn't check error if the op type is 
>>>>>>>>>> not
>>>>>>>>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc 
>>>>>>>>>> offload
>>>>>>>>>> case, where the old flow is deleted first and then the new one is
>>>>>>>>>> created. If the creation fails, the ukey will be stale (no 
>>>>>>>>>> corresponding
>>>>>>>>>> DP flow). This patch adds a warning in such case.
>>>>>>>>>
>>>>>>>>> Not sure I understand, this behavior should be captured by the 
>>>>>>>>> UKEY_INCONSISTENT state.
>>>>>>>>
>>>>>>>> Hi Eelco,
>>>>>>>>
>>>>>>>> thanks for reviewing.
>>>>>>>> We started the patch on older branch as we didn't rebase for some time
>>>>>>>> and got a little later on sending it.
>>>>>>>> I see the addition now for UKEY_INCOSISTENT in the following patch:
>>>>>>>>
>>>>>>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors.
>>>>>>>>
>>>>>>>> Looks like it handles the same situation we tried to handle in this 
>>>>>>>> patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Another possible scenario is in handle_upcalls() if a PUT operation 
>>>>>>>>>> did
>>>>>>>>>> not succeed and op->error attribute was not set correctly it can 
>>>>>>>>>> lead to
>>>>>>>>>> stale ukey in operational state.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Guess we need to figure out these cases, as these are the root cause 
>>>>>>>>> of your problem.
>>>>>>>>
>>>>>>>> right. maybe. but this could help keep the system alive/clean while 
>>>>>>>> logging the
>>>>>>>> bad situation instead of having increase in those stale/inconsistent 
>>>>>>>> ukeys.
>>>>>>>> I understand if it's not nice to handle it like this.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Eelco,
>>>>>>>
>>>>>>> I remember now one of the reproduction scenarios we did is do some 
>>>>>>> traffic
>>>>>>> to get rules added using TC and then delete those from tc command line
>>>>>>> and it got to stale ukeys.
>>>>>>> The revalidator dump thread not seeing the rules so not updating 
>>>>>>> anything
>>>>>>> and sweep over the ukeys skipping them as well.
>>>>>>> This is why we checked against the timing stats of the ukey.
>>>>>>>
>>>>>>> I remember I tested on the upstream code and not our development branch
>>>>>>> and it reproduces. I didn't notice if the commit adding 
>>>>>>> UKEY_INCONSISTENT
>>>>>>> existed but it handles errors from adding flows so I dont think it 
>>>>>>> matters.
>>>>>>>
>>>>>>> I'll try to check and verify again but I think it's still here.
>>>>>>> So while cases getting dop.error handled with UKEY_INCONSISTENT,
>>>>>>> this case I think is not.
>>>>>>
>>>>>> I think you are right this case is not covered by the UKEY_INCONSISTENT 
>>>>>> test below. See my suggestion on fixing this in revalidate_ukey().
>>>>>>
>>>>>> Maybe you could also add a test case for this in the offload test suite.
>>>>>>
>>>>>> //Eelco
>>>>>
>>>>> Hi Eelco,
>>>>>
>>>>> Thanks for the review. I didn't have time to check this but wanted to
>>>>> reply that it's in my queue to verify your suggestion and add a test.
>>>>
>>>> Thanks for letting me know, and I understand as we are all busy :)
>>>>
>>>> //Eelco
>>>>
>>>
>>>
>>> Hi Eelco,
>>>
>>> I tested your suggestion to move to against removing tc rules
>>> from tc command I see it's helpful in cleaning those stale ukeys.
>>> I have trouble doing a clean test for this for the testsuite.
>>
>> With the tc rule deletion, it becomes an offload-specific test case, but as 
>> this is the only way we’ve seen the issue, it might be fine to add it just 
>> there.
>>
>
> yes. after the first half of the fix you pointed out was implemented already.
> now the way i reproduce it is deleting tc rules.
> could potentially happen from something else or because of another bug 
> somewhere.
>
>>> At first I tested with modification in revalidator_sweep__() to
>>> always set seq_mismatch to reach revalidate_ukey().
>>> Later I moved to create a redundant veth and doing add/del loop
>>> of it to the bridge to cause a seq mismatch. - sure this is
>>> the clean way but without the change I get stale ukeys and with
>>> it we get to the cleanup change and cleaning the ukeys.
>>
>> Can you maybe share the test case so I know what you are doing? Is this 
>> working for the general system tests?
>
> I did this:
>
> - create bridge with 2 veth ports. configure ips.
> - ping between the ports to have tc rules.
> - repeated few times: clear the tc rules like this: tc filter del dev veth1 
> ingress and also on the 2nd port.
> - set max-idle to 1 and remove it to cause a flush of the rules.
> - create another set of veth ports. add/del veth4 from the bridge. to cause a 
> sweep.
> - before the fix: ovs-appctl upcall/show will show ukeys.
> - after the fix upcall/show will show 0 ukeys.

Thanks, will find some time to test this, and reply to the V2.

//Eelco

>>> I'll send v2 for now as a clean patch with only the relevant change
>>> and cleaner commit msg and lets take it from there if my suggestion
>>> here of the test is ok or something else or the test can be postponed
>>> to a later time to think of a cleaner/right way to do it.
>>
>> You missed the change in the python script as explained in the definition 
>> header. I will reply to the patch.
>
> oh right. i'll check it. thanks.
>
>>
>> //Eelco
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to