On 2/5/24 12:10, Ilya Maximets wrote: > On 2/5/24 11:58, Dumitru Ceara wrote: >> On 2/5/24 11:34, Ilya Maximets wrote: >>> On 2/5/24 09:23, Dumitru Ceara wrote: >>>> On 2/5/24 08:13, Han Zhou wrote: >>>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique <[email protected]> wrote: >>>>>> >>>>>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou <[email protected]> wrote: >>>>>>> >>>>>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <[email protected]> wrote: >>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> 35 files changed, 9681 insertions(+), 4645 deletions(-) >>>>>>>>>>> >>>>>>>>>>> I had another look at this series and acked the remaining >>>>> patches. I >>>>>>>>>>> just had some minor comments that can be easily fixed when >>>>> applying >>>>>>> the >>>>>>>>>>> patches to the main branch. >>>>>>>>>>> >>>>>>>>>>> Thanks for all the work on this! It was a very large change but >>>>> it >>>>>>>>>>> improves northd performance significantly. I just hope we don't >>>>>>>>>>> introduce too many bugs. Hopefully the time we have until release >>>>>>> will >>>>>>>>>>> allow us to further test this change on the 24.03 branch. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Dumitru >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks a lot Dumitru and Han for the reviews and patience. >>>>>>>>>> >>>>>>>>>> I addressed the comments and applied the patches to main and also >>>>> to >>>>>>>>> branch-24.03. >>>>>>>>>> >>>>>>>>>> @Han - I know you wanted to take another look in to v6. I didn't >>>>> want >>>>>>> to >>>>>>>>> delay further as branch-24.03 was created. I'm more than happy to >>>>>>> submit >>>>>>>>> follow up patches if you have any comments to address. Please let >>>>> me >>>>>>> know. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Numan, >>>>>>>>> >>>>>>>>> I was writing the reply and saw your email just now. Thanks a lot >>>>> for >>>>>>>>> taking a huge effort to achieve the great optimization. I only left >>>>> one >>>>>>>>> comment on the implicit dependency left for the en_lrnat -> >>>>> en_lflow. >>>>>>> Feel >>>>>>>>> free to address it with a followup and no need to block the >>>>> branching. >>>>>>> And >>>>>>>>> take my Ack for the series with that addressed. >>>>>>>>> >>>>>>>>> Acked-by: Han Zhou <hzhou at ovn.org> >>>>>>>> >>>>>>>> >>>>>>>> Hi, Numan, Dumitru and Han. >>>>>>>> >>>>>>>> I see a huge negative performance impact, most likely from this set, >>>>> on >>>>>>>> ovn-heater's cluster-density tests. The memory consumption on northd >>>> >>>> Thanks for reporting this, Ilya! >>>> >>>>>>>> jumped about 4x and it constantly recomputes due to failures of >>>>> port_group >>>>>>>> handler: >>>>>>>> >>>>>>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow, >>>>> recompute >>>>>>> (failed handler for input port_group) took 9762ms >>>>>>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms >>>>> poll >>>>>>> interval (5969ms user, 1786ms system) >>>>>>>> ... >>>>>>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow, >>>>> recompute >>>>>>> (failed handler for input port_group) took 9014ms >>>>>>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms >>>>> poll >>>>>>> interval (5376ms user, 1515ms system) >>>>>>>> ... >>>>>>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow, >>>>> recompute >>>>>>> (failed handler for input port_group) took 10695ms >>>>>>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms >>>>>>> poll interval (6085ms user, 2745ms system) >>>>>>>> ... >>>>>>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow, >>>>> recompute >>>>>>> (failed handler for input port_group) took 9985ms >>>>>>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms >>>>>>> poll interval (5521ms user, 2440ms system) >>>>>>>> >>>>>>>> That increases 95%% ovn-installed latency in 500node cluster-density >>>>> from >>>>>>>> 3.6 seconds last week to 21.5 seconds this week. >>>>>>>> >>>>>>>> I think, this should be a release blocker. >>>>>>>> >>>>>>>> Memory usage is also very concerning. Unfortunately it is not tied >>>>> to the >>>>>>>> cluster-density test. The same 4-5x RSS jump is also seen in other >>>>> test >>>>>>>> like density-heavy. Last week RSS of ovn-northd in cluster-density >>>>> 500 >>>>>>> node >>>>>>>> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and >>>>> 8.5 >>>>>>> GB. >>>>>>>> >>>>>>>> I would consider this as a release blocker as well. >>>>>>>> >>>> >>>> I agree, we shouldn't release 24.03.0 unless these two issues are >>>> (sufficiently) addressed. We do have until March 1st (official release >>>> date) to do that or to revert any patches that cause regressions. >>>> >>>>>>>> >>>>>>>> I don't have direct evidence that this particular series is a >>>>> culprit, but >>>>>>>> it looks like the most likely candidate. I can dig more into >>>>>>> investigation >>>>>>>> on Monday. >>>>>>>> >>>>>>>> Best regards, Ilya Maximets. >>>>>>> >>>>>>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a >>>>> little >>>>>>> surprising to me. I did test this series with my scale test scripts for >>>>>>> recompute performance regression. It was 10+% increase in latency. I >>>>> even >>>>>>> digged a little into it, and noticed ~5% increase caused by the hmap >>>>> used >>>>>>> to maintain the lflows in each lflow_ref. This was discussed in the code >>>>>>> review for an earlier version (v2/v3). Overall it looked not very bad, >>>>> if >>>>>>> we now handle most common scenarios incrementally, and it is reasonable >>>>> to >>>>>>> have some cost for maintaining the references/index for incremental >>>>>>> processing. I wonder if my test scenario was too simple (didn't have LBs >>>>>>> included) to find the problems, so today I did another test by >>>>> including a >>>>>>> LB group with 1k LBs applied to 100 node-LS & GR, and another 1K LBs per >>>>>>> node-LS & GR (101K LBs in total), and I did see more performance penalty >>>>>>> but still within ~20%. While for memory I didn't notice a significant >>>>>>> increase (<10%). I believe I am missing some specific scenario that had >>>>> the >>>>>>> big impact in the ovn-heater's tests. Please share if you dig out more >>>>>>> clues . >>>>>> >>>>>> Hi Ilya, >>>>>> >>>>>> Thanks for reporting these details. >>>>>> >>>>>> I had a look at this regression. There is a significant increase in >>>>>> the lflow recompute >>>>>> time (around 4x) in my local testing and this definitely not acceptable. >>>>>> >>>>>> In this particular cluster density test, whenever a port group is >>>>>> created it results in a full recompute >>>>>> and now since recompute time has increased, it has a cumulative effect >>>>>> on the latency of the test. >>>>>> >>>>>> The dp reference counting [1] added in the v4 of this series has >>>>>> introduced this regression (both CPU and memory). >>>>>> I'm working on this fix and I think I should be able to address this >>>>>> soon. >>>>>> >>>>>> [1] - https://github.com/ovn-org/ovn/blob/main/northd/lflow-mgr.c#L696 >>>>>> >>>>>> >>>>> Indeed, the same concern I mentioned when reviewing v5: >>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/411194.html >>> >>> Yeah. In this cluster-density 500node test, dp_refcnt_use() allocates >>> 120 M references and puts them into hash maps. That alone takes several >>> GB of memory. And it is a largest consumer according to massif. >>> It takes ~45% of total memory allocations. Another ~25% are taken by hash >>> maps where the refcounts are stored. >>> >>> There is a simple optimization that can be made - not allocate refcounts >>> if it is the first time we're trying to add them. The original bitmap >>> should be enough. And if we try to use it again while it is already in >>> a bitmap, then allocate. That saves a lot of space, because 80% of all >>> refcounts in that test never used more than once. However, it might not >>> be enough. Needs testing. Ideally the whole structure needs to go or >>> be replaced by something different. Frequent iterations over the bitmaps >>> and per-datapath*per-lflow memory allocations do not scale. Even if we >>> can make it work reasonably well for 500 nodes, it will cause problems on >>> higher scale. >>> >> >> Isn't this in a way related to the question "does it make sense to fall >> back to a form of recompute at a certain scale"? > > I'm not sure that question is relevant here. We do want I-P to work at > high scale here. So far the recompute is slow, so we want to avoid it, > and we also making it slower by adding more I-P-related code. > > And this only applies to CPU usage. The memory usage is more or less the > same regardless of the computations being done incrementally or not. >
On a slightly related note, these test scenarios are very ovn-kubernetes specific and ovn-kubernetes moved to using a flavor of ovn-ic to address (other) scalability issues in deployments with lots of k8s services (lots of load balancers applied on a LB group attached to each node switch/router). With ovn-ic they'll still have the same number of load balancers applied on each zone's LB group but that group will be applied to way less switches and routers. Just for clarity: I'm _not_ saying we shouldn't fix this. However, we should also bear in mind that the deployment ovn-heater currently uses for load-balancer-dense scenarios is not the deployment model ovn-kubernetes, the CMS we're testing for, moved to (*). We are in the process of changing the test methodology for ovn-heater to cover ovn-ic large scale deployments too. (*) I am aware that the way ovn-kubernetes is configured by some users still deploys a single large zone (no actual IC) but, AFAIK, the k8s services density the users have in that case is likely significantly lower. >> >>>>> >>>>> I just realized that my earlier test had too many individual LBs which >>>>> hide >>>>> this bottleneck, and now with a small change to my script by using just >>>>> the >>>>> single LB group that contains 1k LBs applied to 1k node-LS & GRs I did >>>>> reproduce the performance regression with >50% increase in both latency >>>>> and >>>>> memory during recompute. The ovn-heater test you ran may have more LB >>>>> groups which may have made it even worse. >>>>> >>>> >>>> >>>> Unfortunately onv-heater runs weekly on the latest version of the >>>> upstream branch so it can catch patches that introduce regressions only >>>> after they get merged. >>>> >>>> That's due to the fact that we run ovn-heater downstream inside our lab >>>> and it's not that straight forward to automatically run it on each patch >>>> submission we get upstream. >>> >>> FWIW, the memory usage jump is present in density-heavy tests. And those >>> were performed for this patch set for the stats in the cover letter. >>> Week to week RSS jump for density-heavy test is from 1.5 GB to 5-7 GB. >>> >> >> That's true, we shouldn't have overlooked this. >> >>>> >>>> Maybe it would be an idea to integrate some of Han's performance testing >>>> scripts into the set of tests we already have in the upstream repo, >>>> ovn-performance.at [0], and run those in GitHub actions too. >>>> >>>> [0] https://github.com/ovn-org/ovn/blob/main/tests/ovn-performance.at >>>> >>>> Han, others, what do you think? >>> >>> Define "performance", i.e. what exactly we want to test in such a CI? >>> I don't think we can test latency, for example, because all the public >>> runners are very different and generally weak. There will be a lot of >>> noise in the test results. I also don't think we have enough memory >>> available to run things with high enough scale, even if they are >>> isolated to a single set of OVN daemons. And low scale may not give >>> any good indicators for performance issues. >>> >> >> I was hoping to be able to detect at least some of the memory usage >> increase. >> >> A way to get more reliable results would be to use self-hosted runners >> but that's not something we can do overnight and, similarly to >> ovn-heater, not easy to automatically run on every patch submission. >> >>>> >>>>> Looking forward to the solution. >>>>> >>>> >>>> Numan, please let me know if there's anything I can assist with. >>>> >>>> Regards, >>>> Dumitru >>>> >>> >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
