On 4 Jan 2023, at 23:02, Ilya Maximets wrote:
> On 1/4/23 22:58, Ilya Maximets wrote: >> On 1/4/23 10:43, Eelco Chaudron wrote: >>> >>> >>> On 3 Jan 2023, at 21:03, Ilya Maximets wrote: >>> >>>> On 1/3/23 20:42, William Zhao wrote: >>>>> Hi Ilya and Eelco, >>>>> >>>>> I have seen udpif->dump_duration in revalidate_ukey() go as high as ~2 >>>>> seconds . But the dump duration is very random and sporadic. Most >>>>> likely a function of how busy the CPU is. >>>>> >>>>> We have tried a max-revalidator time of 1 second and we saw >>>>> connectivity losses presumably due to the flows being deleted in the >>>>> "else" case for should_revalidate() call in the revalidate_ukey() >>>>> function. >>>>> >>>>> Taking into consideration the Nvidia MLX driver for instance: >>>>> https://ansymbol.com/linux/v5.14/source/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c#L40 >>>>> . We can see that they poll for stats every 1 second. But definitely >>>>> could be delayed. So the update period of hardware stats is at a >>>>> minimum every 1 second for MLX NICs. >>>> >>>> What it tells me is that stats are not the problem here. >>>> The problem is a very long dump duration. If we're talking any values >>>> above one second for a dump duration we're very likely talking about >>>> hitting a dynamic datapath flow limit. This will affect max-idle >>>> configuration or even trigger 'kill_them_all' condition. >>>> >>>> How many datapath flows does that system have installed? Does the >>>> 'function of how busy the CPU is' mean that something else is running >>>> on the same cores? >>>> >>>> >>>> And if it takes 2 seconds to just dump flows, setting max-revalidator >>>> to 4 seconds should not be a huge deal. >>>> >>>> max-revalidator of 1 second is clearly not enough, as it should be >>>> at least twice as high as expected dump duration. >>> >>> Hi Ilya, >>> >>> Statistics are definitely the problem in this case, and I verified that in >>> William's setup. They might have seen +1 seconds dump times previously >>> however, those who were not responsible for the flow deletes. It was the >>> flow statistics not being synced yet. >>> >>> The part that calculates the pps rate has a flaw, as it assumes the >>> retrieved statistics are always accurate, which is not the case for hw >>> offloaded flows. This is what this patch is trying to fix. So if we do not >>> like the static 2s delay, we could add a configurable delay, but we need a >>> fix for this specific problem. >>> >>> For now, a small spike of more than 1/2 max-revalidator could cause all HW >>> offloaded flows to be removed. And although we could change the >>> max-revalidator to 4 seconds, I do not see this as a final solution, as it >>> affects statistics accuracy, including those of none offloaded flows. >>> >>> Not that in the field I've seen a lot of setups that have occasional spikes >>> up to more than 1/2 max-revalidator, and this should not be a problem. And >>> having to tweak OVS for all these cases not having HW offloads flows time >>> out seems a bit too much. >>> >>> I guess William can debug why he has the occasional +1 spikes (and if those >>> spikes are responsible by dumping the last duration values) but it does not >>> solve the root cause of the flow deletes, which is the statistics not being >>> correctly used. >>> >>> Cheers, >>> >>> Eelco >>> >>>>> On Tue, Jan 3, 2023 at 2:22 PM Ilya Maximets <[email protected]> wrote: >>>>>> >>>>>> On 1/3/23 13:58, Eelco Chaudron wrote: >>>>>>> >>>>>>> >>>>>>> On 3 Jan 2023, at 13:36, Ilya Maximets wrote: >>>>>>> >>>>>>>> On 1/3/23 13:26, Eelco Chaudron wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 3 Jan 2023, at 13:20, Ilya Maximets wrote: >>>>>>>>> >>>>>>>>>> On 1/3/23 12:33, Eelco Chaudron wrote: >>>>>>>>>>> Depending on the driver implementation it can take up to 2 seconds >>>>>>>>>>> before >>>>>>>>>>> offloaded flow statistics are updated. This is causing a problem >>>>>>>>>>> with >>>>>>>>>>> min-revalidate-pps, as old statistic values are used during this >>>>>>>>>>> period. >>>>>>>>>>> >>>>>>>>>>> This fix will wait for at least 2 seconds before assuming no packets >>>>>>>>>>> where received during this period. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Eelco Chaudron <[email protected]> >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> Changes: >>>>>>>>>>> - v2: Use existing ukey->offloaded so the >>>>>>>>>>> revalidate_missed_dp_flow case is >>>>>>>>>>> also covered. >>>>>>>>>>> >>>>>>>>>>> ofproto/ofproto-dpif-upcall.c | 25 +++++++++++++++---------- >>>>>>>>>>> 1 file changed, 15 insertions(+), 10 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>> b/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>> index ad9635496..c395adbc6 100644 >>>>>>>>>>> --- a/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>> @@ -2094,10 +2094,11 @@ ukey_delete(struct umap *umap, struct >>>>>>>>>>> udpif_key *ukey) >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> static bool >>>>>>>>>>> -should_revalidate(const struct udpif *udpif, uint64_t packets, >>>>>>>>>>> - long long int used) >>>>>>>>>>> +should_revalidate(const struct udpif *udpif, struct udpif_key >>>>>>>>>>> *ukey, >>>>>>>>>>> + uint64_t packets) >>>>>>>>>>> { >>>>>>>>>>> long long int metric, now, duration; >>>>>>>>>>> + long long int used = ukey->stats.used; >>>>>>>>>>> >>>>>>>>>>> if (!used) { >>>>>>>>>>> /* Always revalidate the first time a flow is dumped. */ >>>>>>>>>>> @@ -2124,8 +2125,12 @@ should_revalidate(const struct udpif *udpif, >>>>>>>>>>> uint64_t packets, >>>>>>>>>>> duration = now - used; >>>>>>>>>>> metric = duration / packets; >>>>>>>>>>> >>>>>>>>>>> - if (metric < 1000 / ofproto_min_revalidate_pps) { >>>>>>>>>>> - /* The flow is receiving more than min-revalidate-pps, so >>>>>>>>>>> keep it. */ >>>>>>>>>>> + if (metric < 1000 / ofproto_min_revalidate_pps || >>>>>>>>>>> + (ukey->offloaded && duration < 2000)) { >>>>>>>>>>> + /* The flow is receiving more than min-revalidate-pps, so >>>>>>>>>>> keep it. >>>>>>>>>>> + * Or it's a hardware offloaded flow that might take up to >>>>>>>>>>> 2 seconds >>>>>>>>>>> + * to update its statistics. Until we are sure the >>>>>>>>>>> statistics had a >>>>>>>>>>> + * chance to be updated, also keep it. */ >>>>>>>>>> >>>>>>>>>> Hmm. If you know that a flow dump is taking a long time on this >>>>>>>>>> setup, >>>>>>>>>> shouldn't we just bump the max-revalidator value in the database >>>>>>>>>> instead >>>>>>>>>> of hardcoding yet another magic constant? >>>>>>>>> >>>>>>>>> Don’t think increasing the max-revalidator value is a good solution, >>>>>>>>> as it affects >>>>>>>>> the statistics gathering in general. >>>>>>>> >>>>>>>> It affects the judgement on the dump duration and the >>>>>>>> time when revalidators will wake up next time. If your >>>>>>>> dump duration is already high, your statistics gathering >>>>>>>> is already delayed. And your hardware doesn't seem to >>>>>>>> provide statistics in any timely manner as well. >>>>>>> >>>>>>> Well, this is true if the delay is constant, but this is from the last >>>>>>> run, and not all runs take the same amount :( Also, there is a mix of >>>>>>> hardware and kernel flows, where kernel flows do provide real-time >>>>>>> statistics vs max of 2 seconds delayed by HW offload. >>>>>>> >>>>>>> Take the following example; we just added some flows, which will >>>>>>> trigger multiple updates back to back. So let’s say it takes 251ms to >>>>>>> do the update (way under the 1.x seconds the hardware counter is >>>>>>> updated). >>>>>>> >>>>>>> The second iteration would purge all HW offloaded flows, as no counter >>>>>>> was updated in the 251ms window :( >>>>>>> >>>>>>> To solve this we could set the max-revalidator time to 4 seconds, but >>>>>>> this will negatively affect all statistics gathering for flows. And >>>>>>> maybe this might also have some other side effects in existing >>>>>>> configurations. >> >> 4 seconds only needed if the dump takes 2 seconds. If it's 251ms, >> 504ms will do. But, anyway... > > Please, ignore the sentence above, posted a reply in the wrong > part of the email. :/ > >> >> I'm not saying that we don't have issue with statistics. We do have >> 2 separate issues here - dump duration and statistics update frequency. >> Just trying to figure out which is a primary and which is a secondary >> contributor to the flow deletion issues that you observe. >> >> My point is that if dump durations wasn't that unreasonably high, we >> would not see flow deletions. OTOH, if statistics updates wasn't delayed, >> we would not see the issue either. So, I propose a draw. :) Well with the default config a run of 251ms in the default scenario could hit the problem if the stats were not updated between the next and the previous (too long) run. >> We know the root cause of the statistics delay (driver), and we don't >> know the root cause of the long dump duration. So we need to figure >> that out. Yes, the excessive delay we need to figure out. William can you get some more accurate values (I think you have my reval patch applied, https://patchwork.ozlabs.org/project/openvswitch/patch/166367078734.325473.17899801287395410245.stgit@ebuild/, so you could run the script with -q the get the values over time, maybe also do -w for offline analysis). >> Would still be great to know how many datapath flows we're dealing >> with here. The dump above will also give that info. >> 'dump_duration' is a misleading name though, as it is total revalidation >> time including sweep and all other phases. So, '/ 2' part of the check >> in should_revalidate() doesn't seem to make a lot of sense. Removing >> it might help with the situation a bit. >> If the dump durations are too inconsistent, we could replace that metric >> with an exponential moving average of the last N runs (lib/mov-avg.h). Nice, did not know the had this. >> Also we could go the other way and just allow users to disable that >> "optimization" by setting min-revalidate-pps to zero (disabled). >> This is what Han proposed here: >> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/ >> I'm guessing that it's very similar in its roots or even the exact >> same issue that you both are trying to solve. >> >> What do you think? Yes, I thought of doing the same and found Han’s patch. I had two conflicts (in my head ;) - It requires a configuration parameter. - And what if you really want to flush based on load? Once I have my RTE flow setup, I will revisit this patch also to see if it also applies to rte offloads. It looks to me that for rte_flow counters are read on the fly in the driver. Maor, can you maybe confirm? >>>>>> >>>>>> If the dump takes 251ms, you only need to bump max-revalidator to 504ms, >>>>>> not 4 seconds. We should not mix up the time it takes to update the >>>>>> statistics and the dump duration. >>>>>> >>>>>> Flow being revalidated doesn't mean it will be deleted. The flow will >>>>>> be deleted during revalidation only if it is actually incorrect. >>>>>> >>>>>> If it is idle for 10 seconds, it will be deleted without even considering >>>>>> revalidation. This is controlled by the max-idle config. >>>>>> >>>>>> In your case flows didn't reach max-idle yet, but removed *without* >>>>>> revalidation. This happens because OVS thinks that it's too expensive >>>>>> to trigger revalidation based on the flow dump duration. >>>>>> This is controlled by the max-revalidator config. >>>>>> >>>>>> If your actual dump takes 251ms, but max-revalidator is 504ms, the >>>>>> flow should not be removed even if it has zero stats for a next few >>>>>> seconds. As long as the flow itself is correct. >>>>>> >>>>>> That's why we need to know what is the actual dump duration here. >>>>>> >>>>>>> >>>>>>>>> It also does not solve the problem, as any update in the flow >>>>>>>>> table/config can >>>>>>>>> trigger a revalidation. >>>>>>>> >>>>>>>> It will, but there is a check right above the one that >>>>>>>> you're changing and it looks like this: >>>>>>>> >>>>>>>> if (udpif->dump_duration < ofproto_max_revalidator / 2) { >>>>>>>> /* We are likely to handle full revalidation for the flows. */ >>>>>>>> return true; >>>>>>>> } >>>>>>>> >>>>>>>> So, by bumping the max-revalidator value we will avoid >>>>>>>> blind deletion and will properly revalidate the flow. >>>>>>>> The flow will not be deleted then, unless we hit the max-idle. >>>>>>> >>>>>>> See above. >>>>>>> >>>>>>>> Or am I missing something? >>>>>>>> >>>>>>>>> >>>>>>>>> If we do not like the magic static values we can add a configuration >>>>>>>>> option, so it can be adjusted based on the driver implementation. >>>>>>>>> >>>>>>>>>> How long the dump actually takes? And how many flows in it? >>>>>>>>> >>>>>>>>> William do you have these values, as I did not capture any of this >>>>>>>>> when debugging your setup. >>>>>>>>> >>>>>>>>> //Eelco >>>>>>>>> >>>>>>> >>>>>> >>>>> >>> >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
