Re: [ovs-dev] [PATCH ovn] northd: Allow delay of northd engine runs

Dumitru Ceara Tue, 15 Aug 2023 01:49:33 -0700

On 8/14/23 19:04, Han Zhou wrote:
> On Mon, Aug 14, 2023 at 9:54 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> On 8/11/23 15:07, Dumitru Ceara wrote:
>>> On 8/10/23 18:38, Ilya Maximets wrote:
>>>> On 8/10/23 17:34, Dumitru Ceara wrote:
>>>>> On 8/10/23 17:20, Han Zhou wrote:
>>>>>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com>
> wrote:
>>>>>>>
>>>>>>> On 8/10/23 15:34, Han Zhou wrote:
>>>>>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com>
> wrote:
>>>>>>>>>
>>>>>>>>> On 8/10/23 08:12, Ales Musil wrote:
>>>>>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <
> mmich...@redhat.com>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ales,
>>>>>>>>>>>
>>>>>>>>>>> I have some high-level comments/questions about this patch.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Mark,
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Ales, Mark,
>>>>>>>>>
>>>>>>>>>> thank you for the review. See my answers inline below.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I have been privy to the conversations that led to this change.
> My
>>>>>>>>>>> understanding is that by having ovn-northd wake up immediately,
> it
>>>>>> can
>>>>>>>>>>> be more CPU-intensive than waiting a bit for changes to
> accumulate
>>>>>> and
>>>>>>>>>>> handling all of those at once instead. However, nothing in
> either the
>>>>>>>>>>> commit message or ovn-nb.xml explains what the purpose of this
> new
>>>>>>>>>>> configuration option is. I think you should add a sentence or
> two to
>>>>>>>>>>> explain why someone would want to enable this option.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in
> the
>>>>>>>> commit
>>>>>>>>>> message
>>>>>>>>>> together with results from scale run.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 we really need to explain why this change is needed.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Next, the algorithm used here strikes me as odd. We use the
> previous
>>>>>>>> run
>>>>>>>>>>> time of ovn-northd to determine how long to wait before running
>>>>>> again.
>>>>>>>>>>> This delay is capped by the configured backoff time. Let's say
> that
>>>>>>>>>>> we've configured the backoff interval to be 200 ms. If
> ovn-northd
>>>>>> has a
>>>>>>>>>>> super quick run and only takes 10 ms, then we will only delay
> the
>>>>>> next
>>>>>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the
> original
>>>>>>>>>>> issue by much, since we are only allowing a maximum of 20 ms
> (10 ms
>>>>>> for
>>>>>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to
> accumulate.
>>>>>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms
> to
>>>>>>>>>>> complete, then we would delay the next run by 200ms. In this
> case,
>>>>>>>>>>> delaying at all seems like it's not necessary since we
> potentially
>>>>>> have
>>>>>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I
> would
>>>>>>>>>>> have expected the opposite approach to be taken. If someone
>>>>>> configures
>>>>>>>>>>> 200ms as their backoff interval, I would expect us to always
> allow a
>>>>>>>>>>> *minimum* of 200ms of NB changes to accumulate before running
> again.
>>>>>> So
>>>>>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then
> we
>>>>>>>> would
>>>>>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If
> northd
>>>>>> takes
>>>>>>>>>>> a long time to recompute and takes 5000 ms, then we would not
> wait at
>>>>>>>>>>> all before processing changes again. Was the algorithm chosen
> based
>>>>>> on
>>>>>>>>>>> experimentation? Is it a well-known method I'm just not familiar
>>>>>> with?
>>>>>>>>>
>>>>>>>>> I think the main assumption (that should probably be made
> explicit in
>>>>>>>>> the commit log and/or documentation) is that on average changes
> happen
>>>>>>>>> in a uniform way.  This might not always be accurate.
>>>>>>>>>
>>>>>>>>> However, if we're off with the estimate, in the worst case we'd be
>>>>>>>>> adding the configured max delay to the latency of processing
> changes.
>>>>>>>>> So, as long as the value is not extremely high, the impact is not
> that
>>>>>>>>> high either.
>>>>>>>>>
>>>>>>>>> Last but not least, as this value would be configured by the CMS,
> we
>>>>>>>>> assume the CMS knows what they're doing. :)
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm not sure if the algorithm is well known.
>>>>>>>>>>
>>>>>>>>>> The thing is that at scale we almost always cap at the backoff
> so it
>>>>>> has
>>>>>>>>>> probably
>>>>>>>>>> the same effect as your suggestion with the difference that we
>>>>>> actually
>>>>>>>>>> delay even
>>>>>>>>>> after long runs. And that is actually desired, it's true that in
> the
>>>>>>>> let's
>>>>>>>>>> say 500 ms
>>>>>>>>>> should be enough to accumulate more changes however that can
> lead to
>>>>>>>> another
>>>>>>>>>> 500 ms run and so on. That in the end means that northd will
> spin at
>>>>>>>> 100%
>>>>>>>>>> CPU
>>>>>>>>>> anyway which is what we want to avoid. So from another point of
> view
>>>>>> the
>>>>>>>>>> accumulation
>>>>>>>>>> of IDL changes is a secondary effect which is still desired, but
> not
>>>>>> the
>>>>>>>>>> main purpose.
>>>>>>>>>>
>>>>>>>>>> Also delaying by short time if the previous run was short is
> fine, you
>>>>>>>> are
>>>>>>>>>> right that we don't
>>>>>>>>>> accumulate enough however during short running times there is a
> high
>>>>>>>> chance
>>>>>>>>>> that the
>>>>>>>>>> northd would get to sleep anyway (We will help it to sleep at
> least a
>>>>>>>> bit
>>>>>>>>>> nevertheless).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but
>>>>>>>> haven't
>>>>>>>>>>> changed the ovsdb_idl_loop_run() or
> ovsdb_idl_loop_commit_and_wait()
>>>>>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop
> of
>>>>>>>>>>> sleeping and waking because of this? I don't think it should,
> since
>>>>>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions
> waited on
>>>>>>>> by
>>>>>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will
>>>>>> process
>>>>>>>>>> anything
>>>>>>>>>> that it can and wait will be fine. The problem would be if we
> would
>>>>>>>> skip the
>>>>>>>>>> ovsdb_idl_loop_run() for some reason.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Next, does this have any negative impact on our ability to
> perform
>>>>>>>>>>> incremental processing in ovn-northd? My concern is that since
> we are
>>>>>>>>>>> still running the ovsdb IDL loop that if multiple NB changes
> occur
>>>>>>>>>>> during our delay, then we might have to fall back to a full
> recompute
>>>>>>>>>>> instead of being able to incrementally process the changes. Are
> my
>>>>>>>>>>> concerns valid?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I suppose that can happen if there are changes that could result
> in
>>>>>>>>>> "conflict"
>>>>>>>>>> and full recompute. During the test we haven't seen anything like
>>>>>> that.
>>>>>>>>>> The odds of that happening are small because as stated
> previously we
>>>>>> are
>>>>>>>>>> doing
>>>>>>>>>> basically the same as if the engine was running for a long time
> always
>>>>>>>> from
>>>>>>>>>> the IDL
>>>>>>>>>> point of view except that we give IDL a chance to process
> whatever has
>>>>>>>>>> pilled up
>>>>>>>>>> within the sleep period.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Next, has scale testing shown that this change has made a
> positive
>>>>>>>>>>> impact? If so, is there any recommendation for how to determine
> what
>>>>>> to
>>>>>>>>>>> configure the value to?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> It has a huge impact actually the value tested was 200 ms, the
>>>>>>>>>> recommendation
>>>>>>>>>
>>>>>>>>> This was chosen based on the historical data from similar tests
> which
>>>>>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale.
>>>>>>>>>
>>>>>>>>>> would be < 500 ms. After that point the latency on components
> creation
>>>>>>>>>> would be
>>>>>>>>>> very noticable. I will put the recommendation into the
> ovn-nb.xml with
>>>>>>>> the
>>>>>>>>>> latency
>>>>>>>>>> comment. Before I'll post v2 (which has the numbers in commit
> message)
>>>>>>>> those
>>>>>>>>>> are the test results:
>>>>>>>>>>
>>>>>>>>>> Run without any backoff period:
>>>>>>>>>> northd aggregate CPU 9810% avg / 12765% max
>>>>>>>>>> northd was spinning at 100% CPU the entire second half of the
> test.
>>>>>>>>>>
>>>>>>>>>> Run with 200 ms max backoff period:
>>>>>>>>>> northd aggregate CPU 6066% avg / 7689% max
>>>>>>>>>> northd was around 60% for the second half of the test
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Finally, is this change expected to be a long-term necessity?
> This
>>>>>>>>>>> option seems to be useful for cases where northd recomputes are
>>>>>>>>>>> required. Performing recomputes less frequently seems like it
> would
>>>>>>>>>>> lower the CPU usage of ovn-northd while still processing the
> same
>>>>>>>> amount
>>>>>>>>>>> of changes. However, once northd can handle most changes
>>>>>> incrementally,
>>>>>>>>>>> is there still a benefit to delaying running? If each run of
> northd
>>>>>>>>>>> handles all DB changes incrementally, then is there any point in
>>>>>>>> putting
>>>>>>>>>>> delays between those incremental runs?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Ideally we won't need it in the future. However, the assumption
> for
>>>>>> not
>>>>>>>>>> needing
>>>>>>>>>> anything like this is that northd will be fast enough to process
> I-P
>>>>>>>>>> changes and
>>>>>>>>>> be able to sleep between the next batch update arrives from CMS.
> That
>>>>>>>>>> doesn't
>>>>>>>>>> seem to happen in very near future, one thing to keep in mind is
> that
>>>>>>>>>> testing
>>>>>>>>>> happened with Numan's I-P for LBs and lflows which make a huge
>>>>>>>> difference,
>>>>>>>>>> but
>>>>>>>>>> still not enough to achieve the mentioned northd state. So from
> my
>>>>>>>>>> perspective
>>>>>>>>>> it will be relevant for a few releases. And as stated above the
> point
>>>>>>>> is to
>>>>>>>>>> prevent
>>>>>>>>>> northd to spin at 100% CPU all the time.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 it's not the prettiest feature (and some might rightfully call
> it a
>>>>>>>>> hack) but it seems to me like the cleanest alternative for now,
> until
>>>>>>>>> northd processing is fully incremental.
>>>>>>>>
>>>>>>>> In most cases it may be fine, but it might be a problem for a
> worst case
>>>>>>>> scenario:
>>>>>>>>
>>>>>>>> Assume all the changes coming in NB can be incrementally processed
> but
>>>>>> at
>>>>>>>> a very very high rate, and ovn-northd keeps processing the changes
>>>>>>>> incrementally. Since the change rate is so high, ovn-northd barely
>>>>>> keeps up
>>>>>>>> with the changes with 99% CPU load. For example, I-P for each
> object
>>>>>> takes
>>>>>>>> 10ms, and the change rate is 99 objects/sec. According to this
>>>>>> algorithm,
>>>>>>>> ovn-northd will always sleep for the maximum 200ms between each
> IDL run,
>>>>>>>> and then ovn-northd would never keep up with the changes any more
> - the
>>>>>>>> backlog will become longer and longer because of the wasted idle
> time.
>>>>>>>>
>>>>>>>
>>>>>>> IDL runs are not skipped.  Just I-P engine runs.  So I think this
> won't
>>>>>>> be a problem, or am I missing something?
>>>>>>
>>>>>> Sorry about the typo, I meant to say between each "engine run"
> instead of
>>>>>> "IDL run". IDL run is not skipped, but the backlog (accumulated
> changes in
>>>>>> IDL) becomes longer and longer. E.g.:
>>>>>>
>>>>>> (assume change rate is 100 object/sec)
>>>>>>
>>>>>> run-1: handles 1 object, takes 10ms, sleep 10ms
>>>>>> run-2: handles 2 objects, takes 20ms, sleep 20ms
>>>>>> run-3: handles 4 objects, takes 40ms, sleep 40ms
>>>>>> run-4: handles 8 objects, takes 80ms, sleep 80ms
>>>>>> run-5: handles 16 objects, takes 160ms, sleep 160ms
>>>>>> run-6: handles 32 objects, takes 320ms, sleep 200ms
>>>>>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes
> 520ms, sleep
>>>>>> 200ms
>>>>>> run-8: handles 72 objects, takes 720ms, sleep 200ms
>>>>>> run-9: handles 92 objects, takes 920ms, sleep 200ms
>>>>>> ...
>>>>>> As we can see the backlog grows indefinitely if the input keeps
> changing at
>>>>>> the rate of 100 obj/s.
>>>>>>
>>>>>
>>>>> I see now, thanks for the explanation.  But isn't this possible today
>>>>> too?  Maybe less probable though.
>>>>>
>>>>> Also, we'll probably hit other issues (e.g., timeouts on CMS side)
>>>>> because of the backlog which will (likely?) throttle the CMS.
>>>>>
>>>>> What would be a good solution?  It seems hard to define what a "large
>>>>> backlog" is.  If we could do that, maybe triggering a full recompute
>>>>> when the backlog is large enough might be OK.
>>>>
>>>> Hmm.  That's an interesting problem.
>>>> From my understanding, the cost of incremental processing run consists
>>>> of two parts: some more or less constant engine run cost E and a cost
>>>> of processing actual changes P.  Unless CMS is changing the same value
>>>> over and over again, we can't optimize P by accumulating more changes
>>>> per run.  But we can amortize the more or less constant cost E.
>>>>
>>>> We had a similar problem with the full recompute in the past.  If we
>>>> assume the cost of a recompute R to be more or less constant regardless
>>>> of amount of changes, then we can reduce the total cost by aggregating
>>>> more changes per run.  That's what run_idl_loop() function is trying
>>>> to do.  It runs IDL for as long as there are changes (capped at 500ms).
>>>>
>>>> We can't really avoid paying the cost P and it will compound with the
>>>> increased number of changes.  The only solution here, AFAICT, is to
>>>> fall back to full recompute once P > R.  Maybe that can be done by
>>>> somehow calculating the number of tracked vs real rows in IDL.
>>>> Current ovn-northd has this problem and introduction of delays will
>>>> amplify it.
>>>>
>>>
>>> That is still hard to do unfortunately.  The ratio of IDL tracked vs
>>> "real" rows is still not a good enough indication.  It's possible that
>>> northd can quickly incrementally process row updates for table T1 but
>>> that it has to do intensive computations for row updates for table T2.
>>>
>>> A simpler (dumb?) approach is to just periodically recompute.  But that
>>> shouldn't happen too often because then we start using a lot of CPU
>>> again. :)
>>
>>
>> Thinking a bit more about this, if ovn-northd is processing updates
>> slower than they are arriving, the database server will eventually
>> notice a backlog on the connection and will start combining multiple
>> updates into larger ones.  In this case northd will have some chances
>> to catch up.
> 
> Not really. If ovn-northd processes updates slower than they are arriving,
> and if the processing is pure I-P (the processing time is O(N), N = number
> of items changed), I think it will never catch up.
> In other words, ovn-northd would have a max throughput with I-P. Sleeping
> unconditionally would decrease the throughput.
> 
>> However, I think, northd should not sleep under these
>> circumstances.  One way to detect this situation can be checking the
>> time spent in idl run.  We already do that as we limit the idl_run
>> calls with 500ms maximum duration.  So, if we bail from the loop due
>> to duration > 500ms, northd should not sleep and get back to polling
>> as soon as it finished with previous updates.  In other cases it's
>> probably fine to sleep a little, since database updates are not
>> backlogged, i.e. ovn-northd is able to process them in more or less
>> reasonable time.
>>
> 
> Yes, using the last loop duration as an indication to avoid sleeping may
> avoid the performance degradation in this case.
>


I agree, sounds good to me too.  Me personally, I vote for a v3 that:
- allows the cms to set a max backoff time (essentially what v1 & v2 do)
- skips sleeping if we detect idl run loop duration > 500 ms

Regards,
Dumitru


_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] northd: Allow delay of northd engine runs

Reply via email to