Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

Numan Siddique Tue, 06 Feb 2024 13:40:24 -0800

On Tue, Feb 6, 2024 at 3:52 PM Han Zhou <hz...@ovn.org> wrote:
>
> On Mon, Feb 5, 2024 at 7:47 PM Numan Siddique <num...@ovn.org> wrote:
> >
> > On Mon, Feb 5, 2024 at 9:41 PM Han Zhou <hz...@ovn.org> wrote:
> > >
> > > On Mon, Feb 5, 2024 at 4:12 PM Numan Siddique <num...@ovn.org> wrote:
> > > >
> > > > On Mon, Feb 5, 2024 at 5:54 PM Han Zhou <hz...@ovn.org> wrote:
> > > > >
> > > > > On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets <i.maxim...@ovn.org>
> > > wrote:
> > > > > >
> > > > > > On 2/5/24 15:45, Ilya Maximets wrote:
> > > > > > > On 2/5/24 11:34, Ilya Maximets wrote:
> > > > > > >> On 2/5/24 09:23, Dumitru Ceara wrote:
> > > > > > >>> On 2/5/24 08:13, Han Zhou wrote:
> > > > > > >>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique <num...@ovn.org
> >
> > > wrote:
> > > > > > >>>>>
> > > > > > >>>>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou <hz...@ovn.org>
> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <
> > > i.maxim...@ovn.org>
> > > > > wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>>  35 files changed, 9681 insertions(+), 4645
> deletions(-)
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I had another look at this series and acked the
> remaining
> > > > > > >>>> patches.  I
> > > > > > >>>>>>>>>> just had some minor comments that can be easily fixed
> when
> > > > > > >>>> applying
> > > > > > >>>>>> the
> > > > > > >>>>>>>>>> patches to the main branch.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Thanks for all the work on this!  It was a very large
> > > change
> > > > > but
> > > > > > >>>> it
> > > > > > >>>>>>>>>> improves northd performance significantly.  I just
> hope we
> > > > > don't
> > > > > > >>>>>>>>>> introduce too many bugs.  Hopefully the time we have
> until
> > > > > release
> > > > > > >>>>>> will
> > > > > > >>>>>>>>>> allow us to further test this change on the 24.03
> branch.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Regards,
> > > > > > >>>>>>>>>> Dumitru
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Thanks a lot Dumitru and Han for the reviews and
> patience.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I addressed the comments and applied the patches to
> main and
> > > > > also
> > > > > > >>>> to
> > > > > > >>>>>>>> branch-24.03.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> @Han - I know you wanted to take another look in to v6.
>  I
> > > > > didn't
> > > > > > >>>> want
> > > > > > >>>>>> to
> > > > > > >>>>>>>> delay further as branch-24.03 was created.  I'm more than
> > > happy
> > > > > to
> > > > > > >>>>>> submit
> > > > > > >>>>>>>> follow up patches if you have any comments to address.
> > > Please
> > > > > let
> > > > > > >>>> me
> > > > > > >>>>>> know.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Hi Numan,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I was writing the reply and saw your email just now.
> Thanks
> > > a lot
> > > > > > >>>> for
> > > > > > >>>>>>>> taking a huge effort to achieve the great optimization. I
> > > only
> > > > > left
> > > > > > >>>> one
> > > > > > >>>>>>>> comment on the implicit dependency left for the en_lrnat
> ->
> > > > > > >>>> en_lflow.
> > > > > > >>>>>> Feel
> > > > > > >>>>>>>> free to address it with a followup and no need to block
> the
> > > > > > >>>> branching.
> > > > > > >>>>>> And
> > > > > > >>>>>>>> take my Ack for the series with that addressed.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Acked-by: Han Zhou <hzhou at ovn.org>
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> Hi, Numan, Dumitru and Han.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I see a huge negative performance impact, most likely from
> > > this
> > > > > set,
> > > > > > >>>> on
> > > > > > >>>>>>> ovn-heater's cluster-density tests.  The memory
> consumption on
> > > > > northd
> > > > > > >>>
> > > > > > >>> Thanks for reporting this, Ilya!
> > > > > > >>>
> > > > > > >>>>>>> jumped about 4x and it constantly recomputes due to
> failures
> > > of
> > > > > > >>>> port_group
> > > > > > >>>>>>> handler:
> > > > > > >>>>>>>
> > > > > > >>>>>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node:
> lflow,
> > > > > > >>>> recompute
> > > > > > >>>>>> (failed handler for input port_group) took 9762ms
> > > > > > >>>>>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably
> long
> > > > > 9898ms
> > > > > > >>>> poll
> > > > > > >>>>>> interval (5969ms user, 1786ms system)
> > > > > > >>>>>>> ...
> > > > > > >>>>>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node:
> lflow,
> > > > > > >>>> recompute
> > > > > > >>>>>> (failed handler for input port_group) took 9014ms
> > > > > > >>>>>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably
> long
> > > > > 9118ms
> > > > > > >>>> poll
> > > > > > >>>>>> interval (5376ms user, 1515ms system)
> > > > > > >>>>>>> ...
> > > > > > >>>>>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node:
> lflow,
> > > > > > >>>> recompute
> > > > > > >>>>>> (failed handler for input port_group) took 10695ms
> > > > > > >>>>>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably
> long
> > > > > 10890ms
> > > > > > >>>>>> poll interval (6085ms user, 2745ms system)
> > > > > > >>>>>>> ...
> > > > > > >>>>>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node:
> lflow,
> > > > > > >>>> recompute
> > > > > > >>>>>> (failed handler for input port_group) took 9985ms
> > > > > > >>>>>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably
> long
> > > > > 10108ms
> > > > > > >>>>>> poll interval (5521ms user, 2440ms system)
> > > > > > >>>>>>>
> > > > > > >>>>>>> That increases 95%% ovn-installed latency in 500node
> > > > > cluster-density
> > > > > > >>>> from
> > > > > > >>>>>>> 3.6 seconds last week to 21.5 seconds this week.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I think, this should be a release blocker.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Memory usage is also very concerning.  Unfortunately it
> is not
> > > > > tied
> > > > > > >>>> to the
> > > > > > >>>>>>> cluster-density test.  The same 4-5x RSS jump is also
> seen in
> > > > > other
> > > > > > >>>> test
> > > > > > >>>>>>> like density-heavy.  Last week RSS of ovn-northd in
> > > > > cluster-density
> > > > > > >>>> 500
> > > > > > >>>>>> node
> > > > > > >>>>>>> was between 1.5 and 2.5 GB, this week we have a range
> between
> > > 5.5
> > > > > and
> > > > > > >>>> 8.5
> > > > > > >>>>>> GB.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I would consider this as a release blocker as well.
> > > > > > >>>>>>>
> > > > > > >>>
> > > > > > >>> I agree, we shouldn't release 24.03.0 unless these two issues
> are
> > > > > > >>> (sufficiently) addressed.  We do have until March 1st
> (official
> > > > > release
> > > > > > >>> date) to do that or to revert any patches that cause
> regressions.
> > > > > > >>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> I don't have direct evidence that this particular series
> is a
> > > > > > >>>> culprit, but
> > > > > > >>>>>>> it looks like the most likely candidate.  I can dig more
> into
> > > > > > >>>>>> investigation
> > > > > > >>>>>>> on Monday.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Best regards, Ilya Maximets.
> > > > > > >>>>>>
> > > > > > >>>>>> Thanks Ilya for reporting this. 95% latency and 4x RSS
> > > increase is
> > > > > a
> > > > > > >>>> little
> > > > > > >>>>>> surprising to me. I did test this series with my scale test
> > > > > scripts for
> > > > > > >>>>>> recompute performance regression. It was 10+% increase in
> > > latency.
> > > > > I
> > > > > > >>>> even
> > > > > > >>>>>> digged a little into it, and noticed ~5% increase caused
> by the
> > > > > hmap
> > > > > > >>>> used
> > > > > > >>>>>> to maintain the lflows in each lflow_ref. This was
> discussed in
> > > > > the code
> > > > > > >>>>>> review for an earlier version (v2/v3). Overall it looked
> not
> > > very
> > > > > bad,
> > > > > > >>>> if
> > > > > > >>>>>> we now handle most common scenarios incrementally, and it
> is
> > > > > reasonable
> > > > > > >>>> to
> > > > > > >>>>>> have some cost for maintaining the references/index for
> > > incremental
> > > > > > >>>>>> processing. I wonder if my test scenario was too simple
> (didn't
> > > > > have LBs
> > > > > > >>>>>> included) to find the problems, so today I did another
> test by
> > > > > > >>>> including a
> > > > > > >>>>>> LB group with 1k LBs applied to 100 node-LS & GR, and
> another
> > > 1K
> > > > > LBs per
> > > > > > >>>>>> node-LS & GR (101K LBs in total), and I did see more
> > > performance
> > > > > penalty
> > > > > > >>>>>> but still within ~20%. While for memory I didn't notice a
> > > > > significant
> > > > > > >>>>>> increase (<10%). I believe I am missing some specific
> scenario
> > > > > that had
> > > > > > >>>> the
> > > > > > >>>>>> big impact in the ovn-heater's tests. Please share if you
> dig
> > > out
> > > > > more
> > > > > > >>>>>> clues .
> > > > > > >>>>>
> > > > > > >>>>> Hi Ilya,
> > > > > > >>>>>
> > > > > > >>>>> Thanks for reporting these details.
> > > > > > >>>>>
> > > > > > >>>>> I had a look at this regression.   There is a significant
> > > increase
> > > > > in
> > > > > > >>>>> the lflow recompute
> > > > > > >>>>> time (around 4x) in my local testing and this definitely not
> > > > > acceptable.
> > > > > > >>>>>
> > > > > > >>>>> In this particular cluster density test, whenever a port
> group
> > > is
> > > > > > >>>>> created it results in a full recompute
> > > > > > >>>>> and now since recompute time has increased, it has a
> cumulative
> > > > > effect
> > > > > > >>>>> on the latency of the test.
> > > > > > >>>>>
> > > > > > >>>>> The dp reference counting [1] added in the v4 of this
> series has
> > > > > > >>>>> introduced this regression (both CPU and memory).
> > > > > > >>>>> I'm working on this fix and I think I should be able to
> address
> > > > > this soon.
> > > > > > >>>>>
> > > > > > >>>>> [1] -
> > > > > https://github.com/ovn-org/ovn/blob/main/northd/lflow-mgr.c#L696
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>> Indeed, the same concern I mentioned when reviewing v5:
> > > > > > >>>>
> > > > >
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/411194.html
> > > > > > >>
> > > > > > >> Yeah.  In this cluster-density 500node test, dp_refcnt_use()
> > > allocates
> > > > > > >> 120 M references and puts them into hash maps.  That alone
> takes
> > > > > several
> > > > > > >> GB of memory.  And it is a largest consumer according to
> massif.
> > > > > > >> It takes ~45% of total memory allocations.  Another ~25% are
> taken
> > > by
> > > > > hash
> > > > > > >> maps where the refcounts are stored.
> > > > > > >>
> > > > > > >> There is a simple optimization that can be made - not allocate
> > > > > refcounts
> > > > > > >> if it is the first time we're trying to add them.  The original
> > > bitmap
> > > > > > >> should be enough.  And if we try to use it again while it is
> > > already in
> > > > > > >> a bitmap, then allocate.  That saves a lot of space, because
> 80%
> > > of all
> > > > > > >> refcounts in that test never used more than once.  However, it
> > > might
> > > > > not
> > > > > > >> be enough.  Needs testing.  Ideally the whole structure needs
> to
> > > go or
> > > > > > >> be replaced by something different.  Frequent iterations over
> the
> > > > > bitmaps
> > > > > > >> and per-datapath*per-lflow memory allocations do not scale.
> Even
> > > if we
> > > > > > >> can make it work reasonably well for 500 nodes, it will cause
> > > problems
> > > > > on
> > > > > > >> higher scale.
> > > > > > >
> > > > > > > FWIW, I have a small prototype for this partial solution that
> seem
> > > to
> > > > > > > noticeably decrease the memory usage here:
> > > > > > >
> > > > >
> > >
> https://github.com/igsilya/ovn/commit/5ac3a78e287708fd571344c74a021e418fde31a4
> > > > > > >
> > > > > > > I need to run this through the scale run to get actual results
> > > though.
> > > > > >
> > > > > > So, the results for the cluster-density 500 node test are here:
> > > > > >
> > > > > >           | Avg. Poll Intervals | Total test time |  northd RSS
> > > > > > --------------------------------+-----------------+---------------
> > > > > > Last week |     1.5 seconds     |  1005 seconds   |    2.5 GB
> > > > > > This week |      6  seconds     |  2246 seconds   |    8.5 GB
> > > > > > Patch     |     2.5 seconds     |  1170 seconds   |    3.1 GB
> > > > > >
> > > > > > So, the patch does help a lot.  But we still have 66%  increase
> > > > > > of the poll intervals on northd, i.e. 66% increase in recompute
> > > > > > time.  And about 25% increase in memory usage.
> > > > > >
> > > > > > I know that Numan is working on alternative solution to just
> > > > > > remove the reference counting altogether, so I'll hold on sending
> > > > > > an official patch.  Removing the refcounting, if possible, sounds
> > > > > > like a better approach in general.
> > > > > >
> > > > >
> > > > > Thanks Ilya. I also tested with your patch for my previous test case
> > > that
> > > > > had 50% increase on latency and memory, and compared with a patch
> that
> > > > > completely removes the ref count logic but only kept the hmap in
> lflow
> > > > > (without really using it after initializing).
> > > > > The difference is really small in my test (~1%) in both memory and
> > > latency.
> > > > > So I think your patch is quite effective. I believe the ref count
> will
> > > very
> > > > > rarely be larger than 1 in reality, so I think the memory cost
> should be
> > > > > negligible after applying your patch.
> > > > > I was actually hesitating about the approach, primarily concerning
> the
> > > > > latency, because the bitmap scan still happens and for each 1 it
> will
> > > > > perform another bit test. However, with my test it seemed that the
> CPU
> > > cost
> > > > > was also negligible. I did check perf flamegraph and the bitmap
> > > operation
> > > > > didn't show as a significant part.
> > > > >
> > > > > So, your patch seems to me a good candidate solution for this
> particular
> > > > > problem. I am again surprised at the cluster-density 500 node test
> > > result
> > > > > you shared above. Could it be that the majority of the 66% increase
> in
> > > > > compute time and 25% increase in memory is from a different part of
> the
> > > > > change other than the ref count? Did you test with ref count
> completely
> > > > > removed (of course there is a correctness problem when the ref count
> > > would
> > > > > need to be > 1, but just ignore that for now for the performance
> > > > > evaluation).
> > > > >
> > > > > It would be great if Numan comes up with an approach that avoids
> the ref
> > > > > count and I am curious which direction is it. Initially I had the
> > > opinion
> > > > > that for I-P, we should avoid merging lflows' DP/DPGs, but instead
> > > directly
> > > > > create lflows for individual DPs or for DPGs generated beforehand
> (e.g.
> > > for
> > > > > LBs applied to multiple DPs), even if there may be some redundant
> flows
> > > > > generated. However, later after reviewing Numan's patch it seemed
> more
> > > > > generic, and I think it is ok with this more generic approach if the
> > > > > performance cost is acceptable, although it does add more
> complexity.
> > > >
> > > > Let me just briefly explain why I had to add dp ref count in each
> > > ovn_lflow.
> > > >
> > > >   - When a logical flow  L(M, A) is added more than once (i.e
> > > ovn_lflow_add())
> > > >      for the same datapath we need to keep this ref count so that we
> don't
> > > >      delete the logical flow from SB DB when it is actually still
> > > required.
> > > >      I noticed this behavior with this commit [1]
> > > >
> > > >  - There is also another possibility.  There can be 2 load balancers
> > > > lb1 and lb2 with
> > > >    the same VIP and both these lbs are associated to the same
> datapaths.
> > > >    eg.  ovn-nbctl ls-lb-add sw0 lb1
> > > >           ovn-nbctl ls-lb-add sw0 lb2
> > > >    In this case,  ovn-northd would generate exact same logical flows
> > > > for the common VIP
> > > >    We need to have reference counting, otherwise if lb2 is removed
> from
> > > sw0
> > > >    we may delete the lflows from SB DB where as it is still needed for
> > > lb1.
> > > >    This would get fixed when ovn-northd recomputes, but until then its
> > > > a problem.
> > > >    I added this test case here [2] specifically to cover this scenario
> > > (in v4)
> > >
> > > The motivation is clear. But what I mentioned was a different approach.
> In
> > > that approach, we don't care about adding the same lflow multiple times,
> > > provided that it happens rarely. When that happens, there will be
> multiple
> > > equivalent lflows, but each referenced by different objects. For lflows
> > > added with 'od', even if the 'od' is the same, they are added without
> > > checking if there is an old lflow. When the object referencing the one
> of
> > > the lflows is deleted, the linked lflow is deleted accordingly, and the
> > > other 'copies' is still kept in the lflow table. Similarly, for lflows
> > > added with DPG, we don't try to merge the DPGs, and add them as is,
> even if
> > > the DPG may contain the same DPs. There will be duplicated lflows in SB
> > > (with different row uuids), but I believe ovn-controller can handle it
> > > properly.
> >
> > I think your approach seems reasonable to me.
> >
> > I'd perhaps take this approach
> >    - For the same flow added multiple times with the 'od' set use the
> dp_refcnt.
> >     This doesn't happen often and even if so, the cost to maintain the
> > dp_refcnt will be minimal.
> >
> >    - For flows added with dp_group  (mainly for lb_datapaths) as you
> > suggested use a DPG for each lb_datapaths.
> >
> > We can explore that and target for 24.09 perhaps.
> >
> Sounds good.
>
> >
> > >
> > > However, I am not concluding that the approach is better (or worse) than
> > > the current one. It may be more friendly to I-P handling, but having
> > > duplications (even in rare situations) is not ideal, so these are pros
> and
> > > cons. If the ref count performance problem is resolved I think the
> current
> > > approach is very reasonable.
> > >
> > > >
> > > > I tried taking the approach of removing dp ref count and falling back
> > > > to recompute
> > > > when a logical flow's dp count (in the dp group bitmap) doesn't match
> > > with a
> > > > refcount variable.   Here's the WIP patch -
> > > >
> > >
> https://github.com/numansiddique/ovn/commit/92f6563d9e7b1e6c8f2b924dea7a220781e10a05
> > > > I still need to run cluster density tests.
> > > >
> > > > IMO  Ilya's approach seems cleaner to me.   What do you all think ?
> > > > Once the tests are complete, I'll share the results.
> > > >
> > > > And the test results from both Ilya and Han have shown a positive
> impact.
> > >
> > > I am ok with Ilya's patch. My tests showed that it solved the ref
> count's
> > > performance problem, but it is better to also confirm by the ovn-heater
> > > test to prove the same by comparing with a patch that simply removes the
> > > ref count, in case my test coverage is not sufficient.
> > >
> >
> > I did some testing with my patch and these are the findings
> >
> >                                      | Avg. Poll Intervals | Total
> > test time |  northd RSS
> > ------------------------------ +------------------------+---------------
> > Last week                     |      1.5 seconds     |  1005 seconds
> > |    2.5 GB
> > This week                     |      6  seconds       |  2246 seconds
> >  |    8.5 GB
> > Ilya's Patch                   |     2.5 seconds      |  1170 seconds
> >  |    3.1 GB
> > Numan's patch (run1)   |     1.6 seconds      |  992 seconds    |    2.43
> GB
> > Numan's patch (run2)   |     1.7 seconds      |  1022 seconds   |    2.43
> GB
> >
> > Seems like removing dp ref cnt all together is more efficient and very
> close to
> > last week's results.
> >
> > I'll submit the patch tomorrow after some more testing.
> >
> > Feel free to provide any review comments or feedback if you've any.
> >
> > Thanks
> > Numan
> >
> > > >
> > > > Regarding the cluster density 500 node test with Ilya's fix,  I think
> > > > this seems o.k to me since
> > > > the test falls back to recompute a lot and there is some cost now with
> > > > the lflow I-P patches.
> > > > This can be improved by adding port group I-P.   I can start looking
> > > > into the port-group I-P if you
> > > > think it would be worth it.  Let me know.
> > > >
> > >
> > > Adding more I-P would be helpful, but it still doesn't explain the
> > > observation from Ilya regarding the 66% increase in recompute time and
> 25%
> > > increase in memory. My test cases couldn't reproduce such significant
> > > increases (it is ~10-20% in my test). I think we'd better figure out
> what
> > > caused such big increases - is it still related to ref count or is it
> > > something else. I think the lflow_ref might have contributed to it,
> > > especially the use of hmap instead of list. It is totally fine if the
> > > lflow_ref adds some cost, which is required for the I-P to work, but it
> is
> > > better to understand where the cost comes from and justify it (or
> optimize
> > > it).
> >
> > Looks like from my patch results,  dp_ref_cnt is contributing to this
> increase.
> > Is it possible for you to trigger another test with my patch -
> >
> https://github.com/numansiddique/ovn/commit/92f6563d9e7b1e6c8f2b924dea7a220781e10a05
> >
> It is great to see that removing ref count resolved the performance gap.
> The test result is actually very interesting. I also had a test with more
> LBs (10k instead of the earlier 1k) in the large LBG, trying to make the
> cost of bitmap scan more obvious, but still, the difference between your
> patch and Ilya's patch is not significant in my test. The latency
> difference is within 5%, and the memory <1%. So I wonder why in the
> ovn-heater test it is so different. Is it possible that in the ovn-heater
> test it does create a lot of duplicated lflows so ref count hmap operations
> are actually triggered for a significant amount of times, which might have
> contributed to the major cost even with Ilya's patch, and with your patch
> it means it fall back to recompute much more often?
>


With the cluster density ovn-heater tests,  since we don't have I-P
for port groups,
it falls back to full recompute whenever port groups are created.
So there are a lot more lflow recomputes in this test.
We saw this regression once the I-P patches were merged because of the
cost of dp_refcnt_use().

Your point is valid.  We should see if ovn-heater tests do create a
lot of duplicate
flows or not.   I'll check it out.



> Your patch looks good to me at a glance. I'd just add a coverage counter to
> count how many times the engine fallback to recompute because of the refcnt
> mismatch (where the warning log is printed).

Ack.  I'll add the coverage counter.


If we want to understand it
> more clearly, we can  add a coverage counter to Ilya's patch when the ref
> count hmap insertion actually happens (when dp_refcnt_use() is called), so
> that we can tell how many/frequent does it happen during ovn-heater test. If
> the ovn-heater test really results in lots of duplicated lflows, we might
> dig further whether the test scenario is realistic in real world
> deployments, and may need to figure out how to avoid it (or modify the test
> if it is non-realistic).
>
> I'll have a more careful review of the patch when you post it formally.
Thanks

Numan

>
> Thanks,
> Han
>
> > Thanks
> > Numan
> >
> > >
> > > Thanks,
> > > Han
> > >
> > > >
> > > >
> > > > [1] -
> > >
> https://github.com/ovn-org/ovn/commit/fe1c5df98b6f38a59dc5bdbba658a25effe32667
> > > > [2] -
> https://github.com/ovn-org/ovn/blob/main/tests/ovn-northd.at#L11507
> > > >
> > > > Thanks
> > > > Numan
> > > >
> > > > >
> > > > > For Dumitru's proposal:
> > > > > >> Maybe it would be an idea to integrate some of Han's performance
> > > testing
> > > > > >> scripts into the set of tests we already have in the upstream
> repo,
> > > > > >> ovn-performance.at [0], and run those in GitHub actions too.
> > > > >
> > > > > >> [0]
> https://github.com/ovn-org/ovn/blob/main/tests/ovn-performance.at
> > > > >
> > > > > >> Han, others, what do you think?
> > > > >
> > > > > I've been thinking about this for a long time but it wasn't
> > > prioritized. It
> > > > > was not easy for the reasons mentioned by Ilya. But I agree it is
> > > something
> > > > > we need to figure out. For memory increase it should be pretty
> reliable,
> > > > > but for CPU/latency, we might need to think about metrics that take
> into
> > > > > account the performance of the node that executes the test cases. Of
> > > course
> > > > > hardware resources (CPU and memory of the test node) v.s. the scale
> we
> > > can
> > > > > test is another concern, but we may start with a scale that is not
> too
> > > big
> > > > > to run in git actions while big enough to provide good performance
> > > metrics.
> > > > >
> > > > > Thanks,
> > > > > Han
> > > > >
> > > > > > Best regards, Ilya Maximets.
> > > > > _______________________________________________
> > > > > dev mailing list
> > > > > d...@openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > _______________________________________________
> > > dev mailing list
> > > d...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

Reply via email to