Re: [ovs-dev] [PATCH ovn] northd: Allow delay of northd engine runs

Han Zhou Mon, 14 Aug 2023 10:05:16 -0700

On Mon, Aug 14, 2023 at 9:54 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 8/11/23 15:07, Dumitru Ceara wrote:
> > On 8/10/23 18:38, Ilya Maximets wrote:
> >> On 8/10/23 17:34, Dumitru Ceara wrote:
> >>> On 8/10/23 17:20, Han Zhou wrote:
> >>>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com>
wrote:
> >>>>>
> >>>>> On 8/10/23 15:34, Han Zhou wrote:
> >>>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com>
wrote:
> >>>>>>>
> >>>>>>> On 8/10/23 08:12, Ales Musil wrote:
> >>>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <
mmich...@redhat.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Ales,
> >>>>>>>>>
> >>>>>>>>> I have some high-level comments/questions about this patch.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Mark,
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi Ales, Mark,
> >>>>>>>
> >>>>>>>> thank you for the review. See my answers inline below.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> I have been privy to the conversations that led to this change.
My
> >>>>>>>>> understanding is that by having ovn-northd wake up immediately,
it
> >>>> can
> >>>>>>>>> be more CPU-intensive than waiting a bit for changes to
accumulate
> >>>> and
> >>>>>>>>> handling all of those at once instead. However, nothing in
either the
> >>>>>>>>> commit message or ovn-nb.xml explains what the purpose of this
new
> >>>>>>>>> configuration option is. I think you should add a sentence or
two to
> >>>>>>>>> explain why someone would want to enable this option.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in
the
> >>>>>> commit
> >>>>>>>> message
> >>>>>>>> together with results from scale run.
> >>>>>>>>
> >>>>>>>
> >>>>>>> +1 we really need to explain why this change is needed.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Next, the algorithm used here strikes me as odd. We use the
previous
> >>>>>> run
> >>>>>>>>> time of ovn-northd to determine how long to wait before running
> >>>> again.
> >>>>>>>>> This delay is capped by the configured backoff time. Let's say
that
> >>>>>>>>> we've configured the backoff interval to be 200 ms. If
ovn-northd
> >>>> has a
> >>>>>>>>> super quick run and only takes 10 ms, then we will only delay
the
> >>>> next
> >>>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the
original
> >>>>>>>>> issue by much, since we are only allowing a maximum of 20 ms
(10 ms
> >>>> for
> >>>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to
accumulate.
> >>>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms
to
> >>>>>>>>> complete, then we would delay the next run by 200ms. In this
case,
> >>>>>>>>> delaying at all seems like it's not necessary since we
potentially
> >>>> have
> >>>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I
would
> >>>>>>>>> have expected the opposite approach to be taken. If someone
> >>>> configures
> >>>>>>>>> 200ms as their backoff interval, I would expect us to always
allow a
> >>>>>>>>> *minimum* of 200ms of NB changes to accumulate before running
again.
> >>>> So
> >>>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then
we
> >>>>>> would
> >>>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If
northd
> >>>> takes
> >>>>>>>>> a long time to recompute and takes 5000 ms, then we would not
wait at
> >>>>>>>>> all before processing changes again. Was the algorithm chosen
based
> >>>> on
> >>>>>>>>> experimentation? Is it a well-known method I'm just not familiar
> >>>> with?
> >>>>>>>
> >>>>>>> I think the main assumption (that should probably be made
explicit in
> >>>>>>> the commit log and/or documentation) is that on average changes
happen
> >>>>>>> in a uniform way.  This might not always be accurate.
> >>>>>>>
> >>>>>>> However, if we're off with the estimate, in the worst case we'd be
> >>>>>>> adding the configured max delay to the latency of processing
changes.
> >>>>>>> So, as long as the value is not extremely high, the impact is not
that
> >>>>>>> high either.
> >>>>>>>
> >>>>>>> Last but not least, as this value would be configured by the CMS,
we
> >>>>>>> assume the CMS knows what they're doing. :)
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I'm not sure if the algorithm is well known.
> >>>>>>>>
> >>>>>>>> The thing is that at scale we almost always cap at the backoff
so it
> >>>> has
> >>>>>>>> probably
> >>>>>>>> the same effect as your suggestion with the difference that we
> >>>> actually
> >>>>>>>> delay even
> >>>>>>>> after long runs. And that is actually desired, it's true that in
the
> >>>>>> let's
> >>>>>>>> say 500 ms
> >>>>>>>> should be enough to accumulate more changes however that can
lead to
> >>>>>> another
> >>>>>>>> 500 ms run and so on. That in the end means that northd will
spin at
> >>>>>> 100%
> >>>>>>>> CPU
> >>>>>>>> anyway which is what we want to avoid. So from another point of
view
> >>>> the
> >>>>>>>> accumulation
> >>>>>>>> of IDL changes is a secondary effect which is still desired, but
not
> >>>> the
> >>>>>>>> main purpose.
> >>>>>>>>
> >>>>>>>> Also delaying by short time if the previous run was short is
fine, you
> >>>>>> are
> >>>>>>>> right that we don't
> >>>>>>>> accumulate enough however during short running times there is a
high
> >>>>>> chance
> >>>>>>>> that the
> >>>>>>>> northd would get to sleep anyway (We will help it to sleep at
least a
> >>>>>> bit
> >>>>>>>> nevertheless).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but
> >>>>>> haven't
> >>>>>>>>> changed the ovsdb_idl_loop_run() or
ovsdb_idl_loop_commit_and_wait()
> >>>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop
of
> >>>>>>>>> sleeping and waking because of this? I don't think it should,
since
> >>>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions
waited on
> >>>>>> by
> >>>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will
> >>>> process
> >>>>>>>> anything
> >>>>>>>> that it can and wait will be fine. The problem would be if we
would
> >>>>>> skip the
> >>>>>>>> ovsdb_idl_loop_run() for some reason.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Next, does this have any negative impact on our ability to
perform
> >>>>>>>>> incremental processing in ovn-northd? My concern is that since
we are
> >>>>>>>>> still running the ovsdb IDL loop that if multiple NB changes
occur
> >>>>>>>>> during our delay, then we might have to fall back to a full
recompute
> >>>>>>>>> instead of being able to incrementally process the changes. Are
my
> >>>>>>>>> concerns valid?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I suppose that can happen if there are changes that could result
in
> >>>>>>>> "conflict"
> >>>>>>>> and full recompute. During the test we haven't seen anything like
> >>>> that.
> >>>>>>>> The odds of that happening are small because as stated
previously we
> >>>> are
> >>>>>>>> doing
> >>>>>>>> basically the same as if the engine was running for a long time
always
> >>>>>> from
> >>>>>>>> the IDL
> >>>>>>>> point of view except that we give IDL a chance to process
whatever has
> >>>>>>>> pilled up
> >>>>>>>> within the sleep period.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Next, has scale testing shown that this change has made a
positive
> >>>>>>>>> impact? If so, is there any recommendation for how to determine
what
> >>>> to
> >>>>>>>>> configure the value to?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> It has a huge impact actually the value tested was 200 ms, the
> >>>>>>>> recommendation
> >>>>>>>
> >>>>>>> This was chosen based on the historical data from similar tests
which
> >>>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale.
> >>>>>>>
> >>>>>>>> would be < 500 ms. After that point the latency on components
creation
> >>>>>>>> would be
> >>>>>>>> very noticable. I will put the recommendation into the
ovn-nb.xml with
> >>>>>> the
> >>>>>>>> latency
> >>>>>>>> comment. Before I'll post v2 (which has the numbers in commit
message)
> >>>>>> those
> >>>>>>>> are the test results:
> >>>>>>>>
> >>>>>>>> Run without any backoff period:
> >>>>>>>> northd aggregate CPU 9810% avg / 12765% max
> >>>>>>>> northd was spinning at 100% CPU the entire second half of the
test.
> >>>>>>>>
> >>>>>>>> Run with 200 ms max backoff period:
> >>>>>>>> northd aggregate CPU 6066% avg / 7689% max
> >>>>>>>> northd was around 60% for the second half of the test
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Finally, is this change expected to be a long-term necessity?
This
> >>>>>>>>> option seems to be useful for cases where northd recomputes are
> >>>>>>>>> required. Performing recomputes less frequently seems like it
would
> >>>>>>>>> lower the CPU usage of ovn-northd while still processing the
same
> >>>>>> amount
> >>>>>>>>> of changes. However, once northd can handle most changes
> >>>> incrementally,
> >>>>>>>>> is there still a benefit to delaying running? If each run of
northd
> >>>>>>>>> handles all DB changes incrementally, then is there any point in
> >>>>>> putting
> >>>>>>>>> delays between those incremental runs?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> Ideally we won't need it in the future. However, the assumption
for
> >>>> not
> >>>>>>>> needing
> >>>>>>>> anything like this is that northd will be fast enough to process
I-P
> >>>>>>>> changes and
> >>>>>>>> be able to sleep between the next batch update arrives from CMS.
That
> >>>>>>>> doesn't
> >>>>>>>> seem to happen in very near future, one thing to keep in mind is
that
> >>>>>>>> testing
> >>>>>>>> happened with Numan's I-P for LBs and lflows which make a huge
> >>>>>> difference,
> >>>>>>>> but
> >>>>>>>> still not enough to achieve the mentioned northd state. So from
my
> >>>>>>>> perspective
> >>>>>>>> it will be relevant for a few releases. And as stated above the
point
> >>>>>> is to
> >>>>>>>> prevent
> >>>>>>>> northd to spin at 100% CPU all the time.
> >>>>>>>>
> >>>>>>>
> >>>>>>> +1 it's not the prettiest feature (and some might rightfully call
it a
> >>>>>>> hack) but it seems to me like the cleanest alternative for now,
until
> >>>>>>> northd processing is fully incremental.
> >>>>>>
> >>>>>> In most cases it may be fine, but it might be a problem for a
worst case
> >>>>>> scenario:
> >>>>>>
> >>>>>> Assume all the changes coming in NB can be incrementally processed
but
> >>>> at
> >>>>>> a very very high rate, and ovn-northd keeps processing the changes
> >>>>>> incrementally. Since the change rate is so high, ovn-northd barely
> >>>> keeps up
> >>>>>> with the changes with 99% CPU load. For example, I-P for each
object
> >>>> takes
> >>>>>> 10ms, and the change rate is 99 objects/sec. According to this
> >>>> algorithm,
> >>>>>> ovn-northd will always sleep for the maximum 200ms between each
IDL run,
> >>>>>> and then ovn-northd would never keep up with the changes any more
- the
> >>>>>> backlog will become longer and longer because of the wasted idle
time.
> >>>>>>
> >>>>>
> >>>>> IDL runs are not skipped.  Just I-P engine runs.  So I think this
won't
> >>>>> be a problem, or am I missing something?
> >>>>
> >>>> Sorry about the typo, I meant to say between each "engine run"
instead of
> >>>> "IDL run". IDL run is not skipped, but the backlog (accumulated
changes in
> >>>> IDL) becomes longer and longer. E.g.:
> >>>>
> >>>> (assume change rate is 100 object/sec)
> >>>>
> >>>> run-1: handles 1 object, takes 10ms, sleep 10ms
> >>>> run-2: handles 2 objects, takes 20ms, sleep 20ms
> >>>> run-3: handles 4 objects, takes 40ms, sleep 40ms
> >>>> run-4: handles 8 objects, takes 80ms, sleep 80ms
> >>>> run-5: handles 16 objects, takes 160ms, sleep 160ms
> >>>> run-6: handles 32 objects, takes 320ms, sleep 200ms
> >>>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes
520ms, sleep
> >>>> 200ms
> >>>> run-8: handles 72 objects, takes 720ms, sleep 200ms
> >>>> run-9: handles 92 objects, takes 920ms, sleep 200ms
> >>>> ...
> >>>> As we can see the backlog grows indefinitely if the input keeps
changing at
> >>>> the rate of 100 obj/s.
> >>>>
> >>>
> >>> I see now, thanks for the explanation.  But isn't this possible today
> >>> too?  Maybe less probable though.
> >>>
> >>> Also, we'll probably hit other issues (e.g., timeouts on CMS side)
> >>> because of the backlog which will (likely?) throttle the CMS.
> >>>
> >>> What would be a good solution?  It seems hard to define what a "large
> >>> backlog" is.  If we could do that, maybe triggering a full recompute
> >>> when the backlog is large enough might be OK.
> >>
> >> Hmm.  That's an interesting problem.
> >> From my understanding, the cost of incremental processing run consists
> >> of two parts: some more or less constant engine run cost E and a cost
> >> of processing actual changes P.  Unless CMS is changing the same value
> >> over and over again, we can't optimize P by accumulating more changes
> >> per run.  But we can amortize the more or less constant cost E.
> >>
> >> We had a similar problem with the full recompute in the past.  If we
> >> assume the cost of a recompute R to be more or less constant regardless
> >> of amount of changes, then we can reduce the total cost by aggregating
> >> more changes per run.  That's what run_idl_loop() function is trying
> >> to do.  It runs IDL for as long as there are changes (capped at 500ms).
> >>
> >> We can't really avoid paying the cost P and it will compound with the
> >> increased number of changes.  The only solution here, AFAICT, is to
> >> fall back to full recompute once P > R.  Maybe that can be done by
> >> somehow calculating the number of tracked vs real rows in IDL.
> >> Current ovn-northd has this problem and introduction of delays will
> >> amplify it.
> >>
> >
> > That is still hard to do unfortunately.  The ratio of IDL tracked vs
> > "real" rows is still not a good enough indication.  It's possible that
> > northd can quickly incrementally process row updates for table T1 but
> > that it has to do intensive computations for row updates for table T2.
> >
> > A simpler (dumb?) approach is to just periodically recompute.  But that
> > shouldn't happen too often because then we start using a lot of CPU
> > again. :)
>
>
> Thinking a bit more about this, if ovn-northd is processing updates
> slower than they are arriving, the database server will eventually
> notice a backlog on the connection and will start combining multiple
> updates into larger ones.  In this case northd will have some chances
> to catch up.


Not really. If ovn-northd processes updates slower than they are arriving,
and if the processing is pure I-P (the processing time is O(N), N = number
of items changed), I think it will never catch up.
In other words, ovn-northd would have a max throughput with I-P. Sleeping
unconditionally would decrease the throughput.

> However, I think, northd should not sleep under these
> circumstances.  One way to detect this situation can be checking the
> time spent in idl run.  We already do that as we limit the idl_run
> calls with 500ms maximum duration.  So, if we bail from the loop due
> to duration > 500ms, northd should not sleep and get back to polling
> as soon as it finished with previous updates.  In other cases it's
> probably fine to sleep a little, since database updates are not
> backlogged, i.e. ovn-northd is able to process them in more or less
> reasonable time.
>

Yes, using the last loop duration as an indication to avoid sleeping may
avoid the performance degradation in this case.

Thanks,
Han

> Thoughts?
>
> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] northd: Allow delay of northd engine runs

Reply via email to