On 2/5/24 12:10, Ilya Maximets wrote:
> On 2/5/24 11:58, Dumitru Ceara wrote:
>> On 2/5/24 11:34, Ilya Maximets wrote:
>>> On 2/5/24 09:23, Dumitru Ceara wrote:
>>>> On 2/5/24 08:13, Han Zhou wrote:
>>>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique <[email protected]> wrote:
>>>>>>
>>>>>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou <[email protected]> wrote:
>>>>>>>
>>>>>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <[email protected]> wrote:
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> I had another look at this series and acked the remaining
>>>>> patches.  I
>>>>>>>>>>> just had some minor comments that can be easily fixed when
>>>>> applying
>>>>>>> the
>>>>>>>>>>> patches to the main branch.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for all the work on this!  It was a very large change but
>>>>> it
>>>>>>>>>>> improves northd performance significantly.  I just hope we don't
>>>>>>>>>>> introduce too many bugs.  Hopefully the time we have until release
>>>>>>> will
>>>>>>>>>>> allow us to further test this change on the 24.03 branch.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Dumitru
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks a lot Dumitru and Han for the reviews and patience.
>>>>>>>>>>
>>>>>>>>>> I addressed the comments and applied the patches to main and also
>>>>> to
>>>>>>>>> branch-24.03.
>>>>>>>>>>
>>>>>>>>>> @Han - I know you wanted to take another look in to v6.  I didn't
>>>>> want
>>>>>>> to
>>>>>>>>> delay further as branch-24.03 was created.  I'm more than happy to
>>>>>>> submit
>>>>>>>>> follow up patches if you have any comments to address.  Please let
>>>>> me
>>>>>>> know.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Numan,
>>>>>>>>>
>>>>>>>>> I was writing the reply and saw your email just now. Thanks a lot
>>>>> for
>>>>>>>>> taking a huge effort to achieve the great optimization. I only left
>>>>> one
>>>>>>>>> comment on the implicit dependency left for the en_lrnat ->
>>>>> en_lflow.
>>>>>>> Feel
>>>>>>>>> free to address it with a followup and no need to block the
>>>>> branching.
>>>>>>> And
>>>>>>>>> take my Ack for the series with that addressed.
>>>>>>>>>
>>>>>>>>> Acked-by: Han Zhou <hzhou at ovn.org>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi, Numan, Dumitru and Han.
>>>>>>>>
>>>>>>>> I see a huge negative performance impact, most likely from this set,
>>>>> on
>>>>>>>> ovn-heater's cluster-density tests.  The memory consumption on northd
>>>>
>>>> Thanks for reporting this, Ilya!
>>>>
>>>>>>>> jumped about 4x and it constantly recomputes due to failures of
>>>>> port_group
>>>>>>>> handler:
>>>>>>>>
>>>>>>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
>>>>> recompute
>>>>>>> (failed handler for input port_group) took 9762ms
>>>>>>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
>>>>> poll
>>>>>>> interval (5969ms user, 1786ms system)
>>>>>>>> ...
>>>>>>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
>>>>> recompute
>>>>>>> (failed handler for input port_group) took 9014ms
>>>>>>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
>>>>> poll
>>>>>>> interval (5376ms user, 1515ms system)
>>>>>>>> ...
>>>>>>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
>>>>> recompute
>>>>>>> (failed handler for input port_group) took 10695ms
>>>>>>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
>>>>>>> poll interval (6085ms user, 2745ms system)
>>>>>>>> ...
>>>>>>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
>>>>> recompute
>>>>>>> (failed handler for input port_group) took 9985ms
>>>>>>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
>>>>>>> poll interval (5521ms user, 2440ms system)
>>>>>>>>
>>>>>>>> That increases 95%% ovn-installed latency in 500node cluster-density
>>>>> from
>>>>>>>> 3.6 seconds last week to 21.5 seconds this week.
>>>>>>>>
>>>>>>>> I think, this should be a release blocker.
>>>>>>>>
>>>>>>>> Memory usage is also very concerning.  Unfortunately it is not tied
>>>>> to the
>>>>>>>> cluster-density test.  The same 4-5x RSS jump is also seen in other
>>>>> test
>>>>>>>> like density-heavy.  Last week RSS of ovn-northd in cluster-density
>>>>> 500
>>>>>>> node
>>>>>>>> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
>>>>> 8.5
>>>>>>> GB.
>>>>>>>>
>>>>>>>> I would consider this as a release blocker as well.
>>>>>>>>
>>>>
>>>> I agree, we shouldn't release 24.03.0 unless these two issues are
>>>> (sufficiently) addressed.  We do have until March 1st (official release
>>>> date) to do that or to revert any patches that cause regressions.
>>>>
>>>>>>>>
>>>>>>>> I don't have direct evidence that this particular series is a
>>>>> culprit, but
>>>>>>>> it looks like the most likely candidate.  I can dig more into
>>>>>>> investigation
>>>>>>>> on Monday.
>>>>>>>>
>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
>>>>> little
>>>>>>> surprising to me. I did test this series with my scale test scripts for
>>>>>>> recompute performance regression. It was 10+% increase in latency. I
>>>>> even
>>>>>>> digged a little into it, and noticed ~5% increase caused by the hmap
>>>>> used
>>>>>>> to maintain the lflows in each lflow_ref. This was discussed in the code
>>>>>>> review for an earlier version (v2/v3). Overall it looked not very bad,
>>>>> if
>>>>>>> we now handle most common scenarios incrementally, and it is reasonable
>>>>> to
>>>>>>> have some cost for maintaining the references/index for incremental
>>>>>>> processing. I wonder if my test scenario was too simple (didn't have LBs
>>>>>>> included) to find the problems, so today I did another test by
>>>>> including a
>>>>>>> LB group with 1k LBs applied to 100 node-LS & GR, and another 1K LBs per
>>>>>>> node-LS & GR (101K LBs in total), and I did see more performance penalty
>>>>>>> but still within ~20%. While for memory I didn't notice a significant
>>>>>>> increase (<10%). I believe I am missing some specific scenario that had
>>>>> the
>>>>>>> big impact in the ovn-heater's tests. Please share if you dig out more
>>>>>>> clues .
>>>>>>
>>>>>> Hi Ilya,
>>>>>>
>>>>>> Thanks for reporting these details.
>>>>>>
>>>>>> I had a look at this regression.   There is a significant increase in
>>>>>> the lflow recompute
>>>>>> time (around 4x) in my local testing and this definitely not acceptable.
>>>>>>
>>>>>> In this particular cluster density test, whenever a port group is
>>>>>> created it results in a full recompute
>>>>>> and now since recompute time has increased, it has a cumulative effect
>>>>>> on the latency of the test.
>>>>>>
>>>>>> The dp reference counting [1] added in the v4 of this series has
>>>>>> introduced this regression (both CPU and memory).
>>>>>> I'm working on this fix and I think I should be able to address this 
>>>>>> soon.
>>>>>>
>>>>>> [1] - https://github.com/ovn-org/ovn/blob/main/northd/lflow-mgr.c#L696
>>>>>>
>>>>>>
>>>>> Indeed, the same concern I mentioned when reviewing v5:
>>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/411194.html
>>>
>>> Yeah.  In this cluster-density 500node test, dp_refcnt_use() allocates
>>> 120 M references and puts them into hash maps.  That alone takes several
>>> GB of memory.  And it is a largest consumer according to massif.
>>> It takes ~45% of total memory allocations.  Another ~25% are taken by hash
>>> maps where the refcounts are stored.
>>>
>>> There is a simple optimization that can be made - not allocate refcounts
>>> if it is the first time we're trying to add them.  The original bitmap
>>> should be enough.  And if we try to use it again while it is already in
>>> a bitmap, then allocate.  That saves a lot of space, because 80% of all
>>> refcounts in that test never used more than once.  However, it might not
>>> be enough.  Needs testing.  Ideally the whole structure needs to go or
>>> be replaced by something different.  Frequent iterations over the bitmaps
>>> and per-datapath*per-lflow memory allocations do not scale.  Even if we
>>> can make it work reasonably well for 500 nodes, it will cause problems on
>>> higher scale.
>>>
>>
>> Isn't this in a way related to the question "does it make sense to fall
>> back to a form of recompute at a certain scale"?
> 
> I'm not sure that question is relevant here.  We do want I-P to work at
> high scale here.  So far the recompute is slow, so we want to avoid it,
> and we also making it slower by adding more I-P-related code.
> 
> And this only applies to CPU usage.  The memory usage is more or less the
> same regardless of the computations being done incrementally or not.
> 

On a slightly related note, these test scenarios are very ovn-kubernetes
specific and ovn-kubernetes moved to using a flavor of ovn-ic to address
(other) scalability issues in deployments with lots of k8s services
(lots of load balancers applied on a LB group attached to each node
switch/router).  With ovn-ic they'll still have the same number of load
balancers applied on each zone's LB group but that group will be applied
to way less switches and routers.

Just for clarity: I'm _not_ saying we shouldn't fix this.  However, we
should also bear in mind that the deployment ovn-heater currently uses
for load-balancer-dense scenarios is not the deployment model
ovn-kubernetes, the CMS we're testing for, moved to (*).

We are in the process of changing the test methodology for ovn-heater to
cover ovn-ic large scale deployments too.

(*) I am aware that the way ovn-kubernetes is configured by some users
still deploys a single large zone (no actual IC) but, AFAIK, the k8s
services density the users have in that case is likely significantly lower.

>>
>>>>>
>>>>> I just realized that my earlier test had too many individual LBs which 
>>>>> hide
>>>>> this bottleneck, and now with a small change to my script by using just 
>>>>> the
>>>>> single LB group that contains 1k LBs applied to 1k node-LS & GRs I did
>>>>> reproduce the performance regression with >50% increase in both latency 
>>>>> and
>>>>> memory during recompute. The ovn-heater test you ran may have more LB
>>>>> groups which may have made it even worse.
>>>>>
>>>>
>>>>
>>>> Unfortunately onv-heater runs weekly on the latest version of the
>>>> upstream branch so it can catch patches that introduce regressions only
>>>> after they get merged.
>>>>
>>>> That's due to the fact that we run ovn-heater downstream inside our lab
>>>> and it's not that straight forward to automatically run it on each patch
>>>> submission we get upstream.
>>>
>>> FWIW, the memory usage jump is present in density-heavy tests.  And those
>>> were performed for this patch set for the stats in the cover letter.
>>> Week to week RSS jump for density-heavy test is from 1.5 GB to 5-7 GB.
>>>
>>
>> That's true, we shouldn't have overlooked this.
>>
>>>>
>>>> Maybe it would be an idea to integrate some of Han's performance testing
>>>> scripts into the set of tests we already have in the upstream repo,
>>>> ovn-performance.at [0], and run those in GitHub actions too.
>>>>
>>>> [0] https://github.com/ovn-org/ovn/blob/main/tests/ovn-performance.at
>>>>
>>>> Han, others, what do you think?
>>>
>>> Define "performance", i.e. what exactly we want to test in such a CI?
>>> I don't think we can test latency, for example, because all the public
>>> runners are very different and generally weak.  There will be a lot of
>>> noise in the test results.  I also don't think we have enough memory
>>> available to run things with high enough scale, even if they are
>>> isolated to a single set of OVN daemons.  And low scale may not give
>>> any good indicators for performance issues.
>>>
>>
>> I was hoping to be able to detect at least some of the memory usage
>> increase.
>>
>> A way to get more reliable results would be to use self-hosted runners
>> but that's not something we can do overnight and, similarly to
>> ovn-heater, not easy to automatically run on every patch submission.
>>
>>>>
>>>>> Looking forward to the solution.
>>>>>
>>>>
>>>> Numan, please let me know if there's anything I can assist with.
>>>>
>>>> Regards,
>>>> Dumitru
>>>>
>>>
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to