On 4 Jan 2023, at 23:02, Ilya Maximets wrote:

> On 1/4/23 22:58, Ilya Maximets wrote:
>> On 1/4/23 10:43, Eelco Chaudron wrote:
>>>
>>>
>>> On 3 Jan 2023, at 21:03, Ilya Maximets wrote:
>>>
>>>> On 1/3/23 20:42, William Zhao wrote:
>>>>> Hi Ilya and Eelco,
>>>>>
>>>>> I have seen udpif->dump_duration in revalidate_ukey() go as high as ~2
>>>>> seconds . But the dump duration is very random and sporadic. Most
>>>>> likely a function of how busy the CPU is.
>>>>>
>>>>> We have tried a max-revalidator time of 1 second and we saw
>>>>> connectivity losses presumably due to the flows being deleted in the
>>>>> "else" case for should_revalidate() call in the revalidate_ukey()
>>>>> function.
>>>>>
>>>>> Taking into consideration the Nvidia MLX driver for instance:
>>>>> https://ansymbol.com/linux/v5.14/source/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c#L40
>>>>> . We can see that they poll for stats every 1 second. But definitely
>>>>> could be delayed. So the update period of hardware stats is at a
>>>>> minimum every 1 second for MLX NICs.
>>>>
>>>> What it tells me is that stats are not the problem here.
>>>> The problem is a very long dump duration.  If we're talking any values
>>>> above one second for a dump duration we're very likely talking about
>>>> hitting a dynamic datapath flow limit.  This will affect max-idle
>>>> configuration or even trigger 'kill_them_all' condition.
>>>>
>>>> How many datapath flows does that system have installed?  Does the
>>>> 'function of how busy the CPU is' mean that something else is running
>>>> on the same cores?
>>>>
>>>>
>>>> And if it takes 2 seconds to just dump flows, setting max-revalidator
>>>> to 4 seconds should not be a huge deal.
>>>>
>>>> max-revalidator of 1 second is clearly not enough, as it should be
>>>> at least twice as high as expected dump duration.
>>>
>>> Hi Ilya,
>>>
>>> Statistics are definitely the problem in this case, and I verified that in 
>>> William's setup. They might have seen +1 seconds dump times previously 
>>> however, those who were not responsible for the flow deletes. It was the 
>>> flow statistics not being synced yet.
>>>
>>> The part that calculates the pps rate has a flaw, as it assumes the 
>>> retrieved statistics are always accurate, which is not the case for hw 
>>> offloaded flows. This is what this patch is trying to fix. So if we do not 
>>> like the static 2s delay, we could add a configurable delay, but we need a 
>>> fix for this specific problem.
>>>
>>> For now, a small spike of more than 1/2 max-revalidator could cause all HW 
>>> offloaded flows to be removed. And although we could change the 
>>> max-revalidator to 4 seconds, I do not see this as a final solution, as it 
>>> affects statistics accuracy, including those of none offloaded flows.
>>>
>>> Not that in the field I've seen a lot of setups that have occasional spikes 
>>> up to more than 1/2 max-revalidator, and this should not be a problem. And 
>>> having to tweak OVS for all these cases not having HW offloads flows time 
>>> out seems a bit too much.
>>>
>>> I guess William can debug why he has the occasional +1 spikes (and if those 
>>> spikes are responsible by dumping the last duration values) but it does not 
>>> solve the root cause of the flow deletes, which is the statistics not being 
>>> correctly used.
>>>
>>> Cheers,
>>>
>>> Eelco
>>>
>>>>> On Tue, Jan 3, 2023 at 2:22 PM Ilya Maximets <[email protected]> wrote:
>>>>>>
>>>>>> On 1/3/23 13:58, Eelco Chaudron wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 3 Jan 2023, at 13:36, Ilya Maximets wrote:
>>>>>>>
>>>>>>>> On 1/3/23 13:26, Eelco Chaudron wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3 Jan 2023, at 13:20, Ilya Maximets wrote:
>>>>>>>>>
>>>>>>>>>> On 1/3/23 12:33, Eelco Chaudron wrote:
>>>>>>>>>>> Depending on the driver implementation it can take up to 2 seconds 
>>>>>>>>>>> before
>>>>>>>>>>> offloaded flow statistics are updated. This is causing a problem 
>>>>>>>>>>> with
>>>>>>>>>>> min-revalidate-pps, as old statistic values are used during this 
>>>>>>>>>>> period.
>>>>>>>>>>>
>>>>>>>>>>> This fix will wait for at least 2 seconds before assuming no packets
>>>>>>>>>>> where received during this period.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Eelco Chaudron <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> Changes:
>>>>>>>>>>>  - v2: Use existing ukey->offloaded so the 
>>>>>>>>>>> revalidate_missed_dp_flow case is
>>>>>>>>>>>        also covered.
>>>>>>>>>>>
>>>>>>>>>>>  ofproto/ofproto-dpif-upcall.c |   25 +++++++++++++++----------
>>>>>>>>>>>  1 file changed, 15 insertions(+), 10 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c 
>>>>>>>>>>> b/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>>> index ad9635496..c395adbc6 100644
>>>>>>>>>>> --- a/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>>> @@ -2094,10 +2094,11 @@ ukey_delete(struct umap *umap, struct 
>>>>>>>>>>> udpif_key *ukey)
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>>  static bool
>>>>>>>>>>> -should_revalidate(const struct udpif *udpif, uint64_t packets,
>>>>>>>>>>> -                  long long int used)
>>>>>>>>>>> +should_revalidate(const struct udpif *udpif, struct udpif_key 
>>>>>>>>>>> *ukey,
>>>>>>>>>>> +                  uint64_t packets)
>>>>>>>>>>>  {
>>>>>>>>>>>      long long int metric, now, duration;
>>>>>>>>>>> +    long long int used = ukey->stats.used;
>>>>>>>>>>>
>>>>>>>>>>>      if (!used) {
>>>>>>>>>>>          /* Always revalidate the first time a flow is dumped. */
>>>>>>>>>>> @@ -2124,8 +2125,12 @@ should_revalidate(const struct udpif *udpif, 
>>>>>>>>>>> uint64_t packets,
>>>>>>>>>>>      duration = now - used;
>>>>>>>>>>>      metric = duration / packets;
>>>>>>>>>>>
>>>>>>>>>>> -    if (metric < 1000 / ofproto_min_revalidate_pps) {
>>>>>>>>>>> -        /* The flow is receiving more than min-revalidate-pps, so 
>>>>>>>>>>> keep it. */
>>>>>>>>>>> +    if (metric < 1000 / ofproto_min_revalidate_pps ||
>>>>>>>>>>> +        (ukey->offloaded && duration < 2000)) {
>>>>>>>>>>> +        /* The flow is receiving more than min-revalidate-pps, so 
>>>>>>>>>>> keep it.
>>>>>>>>>>> +         * Or it's a hardware offloaded flow that might take up to 
>>>>>>>>>>> 2 seconds
>>>>>>>>>>> +         * to update its statistics. Until we are sure the 
>>>>>>>>>>> statistics had a
>>>>>>>>>>> +         * chance to be updated, also keep it. */
>>>>>>>>>>
>>>>>>>>>> Hmm.  If you know that a flow dump is taking a long time on this 
>>>>>>>>>> setup,
>>>>>>>>>> shouldn't we just bump the max-revalidator value in the database 
>>>>>>>>>> instead
>>>>>>>>>> of hardcoding yet another magic constant?
>>>>>>>>>
>>>>>>>>> Don’t think increasing the max-revalidator value is a good solution, 
>>>>>>>>> as it affects
>>>>>>>>> the statistics gathering in general.
>>>>>>>>
>>>>>>>> It affects the judgement on the dump duration and the
>>>>>>>> time when revalidators will wake up next time.  If your
>>>>>>>> dump duration is already high, your statistics gathering
>>>>>>>> is already delayed.  And your hardware doesn't seem to
>>>>>>>> provide statistics in any timely manner as well.
>>>>>>>
>>>>>>> Well, this is true if the delay is constant, but this is from the last 
>>>>>>> run, and not all runs take the same amount :( Also, there is a mix of 
>>>>>>> hardware and kernel flows, where kernel flows do provide real-time 
>>>>>>> statistics vs max of 2 seconds delayed by HW offload.
>>>>>>>
>>>>>>> Take the following example; we just added some flows, which will 
>>>>>>> trigger multiple updates back to back. So let’s say it takes 251ms to 
>>>>>>> do the update (way under the 1.x seconds the hardware counter is 
>>>>>>> updated).
>>>>>>>
>>>>>>> The second iteration would purge all HW offloaded flows, as no counter 
>>>>>>> was updated in the 251ms window :(
>>>>>>>
>>>>>>> To solve this we could set the max-revalidator time to 4 seconds, but 
>>>>>>> this will negatively affect all statistics gathering for flows. And 
>>>>>>> maybe this might also have some other side effects in existing 
>>>>>>> configurations.
>>
>> 4 seconds only needed if the dump takes 2 seconds.  If it's 251ms,
>> 504ms will do.  But, anyway...
>
> Please, ignore the sentence above, posted a reply in the wrong
> part of the email. :/
>
>>
>> I'm not saying that we don't have issue with statistics.  We do have
>> 2 separate issues here - dump duration and statistics update frequency.
>> Just trying to figure out which is a primary and which is a secondary
>> contributor to the flow deletion issues that you observe.
>>
>> My point is that if dump durations wasn't that unreasonably high, we
>> would not see flow deletions.  OTOH, if statistics updates wasn't delayed,
>> we would not see the issue either.  So, I propose a draw. :)

Well with the default config a run of 251ms in the default scenario could hit 
the problem if the stats were not updated between the next and the previous 
(too long) run.

>> We know the root cause of the statistics delay (driver), and we don't
>> know the root cause of the long dump duration.  So we need to figure
>> that out.

Yes, the excessive delay we need to figure out.

William can you get some more accurate values (I think you have my reval patch 
applied, 
https://patchwork.ozlabs.org/project/openvswitch/patch/166367078734.325473.17899801287395410245.stgit@ebuild/,
 so you could run the script with -q the get the values over time, maybe also 
do -w for offline analysis).

>> Would still be great to know how many datapath flows we're dealing
>> with here.

The dump above will also give that info.

>> 'dump_duration' is a misleading name though, as it is total revalidation
>> time including sweep and all other phases.  So, '/ 2' part of the check
>> in should_revalidate() doesn't seem to make a lot of sense.  Removing
>> it might help with the situation a bit.
>> If the dump durations are too inconsistent, we could replace that metric
>> with an exponential moving average of the last N runs (lib/mov-avg.h).

Nice, did not know the had this.

>> Also we could go the other way and just allow users to disable that
>> "optimization" by setting min-revalidate-pps to zero (disabled).
>> This is what Han proposed here:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>> I'm guessing that it's very similar in its roots or even the exact
>> same issue that you both are trying to solve.
>>
>> What do you think?

Yes, I thought of doing the same and found Han’s patch. I had two conflicts (in 
my head ;)

- It requires a configuration parameter.
- And what if you really want to flush based on load?

Once I have my RTE flow setup, I will revisit this patch also to see if it also 
applies to rte offloads.
It looks to me that for rte_flow counters are read on the fly in the driver.

Maor, can you maybe confirm?


>>>>>>
>>>>>> If the dump takes 251ms, you only need to bump max-revalidator to 504ms,
>>>>>> not 4 seconds.  We should not mix up the time it takes to update the
>>>>>> statistics and the dump duration.
>>>>>>
>>>>>> Flow being revalidated doesn't mean it will be deleted.  The flow will
>>>>>> be deleted during revalidation only if it is actually incorrect.
>>>>>>
>>>>>> If it is idle for 10 seconds, it will be deleted without even considering
>>>>>> revalidation.  This is controlled by the max-idle config.
>>>>>>
>>>>>> In your case flows didn't reach max-idle yet, but removed *without*
>>>>>> revalidation.  This happens because OVS thinks that it's too expensive
>>>>>> to trigger revalidation based on the flow dump duration.
>>>>>> This is controlled by the max-revalidator config.
>>>>>>
>>>>>> If your actual dump takes 251ms, but max-revalidator is 504ms, the
>>>>>> flow should not be removed even if it has zero stats for a next few
>>>>>> seconds.  As long as the flow itself is correct.
>>>>>>
>>>>>> That's why we need to know what is the actual dump duration here.
>>>>>>
>>>>>>>
>>>>>>>>> It also does not solve the problem, as any update in the flow 
>>>>>>>>> table/config can
>>>>>>>>> trigger a revalidation.
>>>>>>>>
>>>>>>>> It will, but there is a check right above the one that
>>>>>>>> you're changing and it looks like this:
>>>>>>>>
>>>>>>>>     if (udpif->dump_duration < ofproto_max_revalidator / 2) {
>>>>>>>>         /* We are likely to handle full revalidation for the flows. */
>>>>>>>>         return true;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> So, by bumping the max-revalidator value we will avoid
>>>>>>>> blind deletion and will properly revalidate the flow.
>>>>>>>> The flow will not be deleted then, unless we hit the max-idle.
>>>>>>>
>>>>>>> See above.
>>>>>>>
>>>>>>>> Or am I missing something?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> If we do not like the magic static values we can add a configuration 
>>>>>>>>> option, so it can be adjusted based on the driver implementation.
>>>>>>>>>
>>>>>>>>>> How long the dump actually takes?  And how many flows in it?
>>>>>>>>>
>>>>>>>>> William do you have these values, as I did not capture any of this 
>>>>>>>>> when debugging your setup.
>>>>>>>>>
>>>>>>>>> //Eelco
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to