On 8/10/23 18:38, Ilya Maximets wrote:
> On 8/10/23 17:34, Dumitru Ceara wrote:
>> On 8/10/23 17:20, Han Zhou wrote:
>>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>
>>>> On 8/10/23 15:34, Han Zhou wrote:
>>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>>>
>>>>>> On 8/10/23 08:12, Ales Musil wrote:
>>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Ales,
>>>>>>>>
>>>>>>>> I have some high-level comments/questions about this patch.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>
>>>>>> Hi Ales, Mark,
>>>>>>
>>>>>>> thank you for the review. See my answers inline below.
>>>>>>>
>>>>>>>
>>>>>>>> I have been privy to the conversations that led to this change. My
>>>>>>>> understanding is that by having ovn-northd wake up immediately, it
>>> can
>>>>>>>> be more CPU-intensive than waiting a bit for changes to accumulate
>>> and
>>>>>>>> handling all of those at once instead. However, nothing in either the
>>>>>>>> commit message or ovn-nb.xml explains what the purpose of this new
>>>>>>>> configuration option is. I think you should add a sentence or two to
>>>>>>>> explain why someone would want to enable this option.
>>>>>>>>
>>>>>>>>
>>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in the
>>>>> commit
>>>>>>> message
>>>>>>> together with results from scale run.
>>>>>>>
>>>>>>
>>>>>> +1 we really need to explain why this change is needed.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Next, the algorithm used here strikes me as odd. We use the previous
>>>>> run
>>>>>>>> time of ovn-northd to determine how long to wait before running
>>> again.
>>>>>>>> This delay is capped by the configured backoff time. Let's say that
>>>>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd
>>> has a
>>>>>>>> super quick run and only takes 10 ms, then we will only delay the
>>> next
>>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the original
>>>>>>>> issue by much, since we are only allowing a maximum of 20 ms (10 ms
>>> for
>>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate.
>>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms to
>>>>>>>> complete, then we would delay the next run by 200ms. In this case,
>>>>>>>> delaying at all seems like it's not necessary since we potentially
>>> have
>>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I would
>>>>>>>> have expected the opposite approach to be taken. If someone
>>> configures
>>>>>>>> 200ms as their backoff interval, I would expect us to always allow a
>>>>>>>> *minimum* of 200ms of NB changes to accumulate before running again.
>>> So
>>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then we
>>>>> would
>>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd
>>> takes
>>>>>>>> a long time to recompute and takes 5000 ms, then we would not wait at
>>>>>>>> all before processing changes again. Was the algorithm chosen based
>>> on
>>>>>>>> experimentation? Is it a well-known method I'm just not familiar
>>> with?
>>>>>>
>>>>>> I think the main assumption (that should probably be made explicit in
>>>>>> the commit log and/or documentation) is that on average changes happen
>>>>>> in a uniform way.  This might not always be accurate.
>>>>>>
>>>>>> However, if we're off with the estimate, in the worst case we'd be
>>>>>> adding the configured max delay to the latency of processing changes.
>>>>>> So, as long as the value is not extremely high, the impact is not that
>>>>>> high either.
>>>>>>
>>>>>> Last but not least, as this value would be configured by the CMS, we
>>>>>> assume the CMS knows what they're doing. :)
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I'm not sure if the algorithm is well known.
>>>>>>>
>>>>>>> The thing is that at scale we almost always cap at the backoff so it
>>> has
>>>>>>> probably
>>>>>>> the same effect as your suggestion with the difference that we
>>> actually
>>>>>>> delay even
>>>>>>> after long runs. And that is actually desired, it's true that in the
>>>>> let's
>>>>>>> say 500 ms
>>>>>>> should be enough to accumulate more changes however that can lead to
>>>>> another
>>>>>>> 500 ms run and so on. That in the end means that northd will spin at
>>>>> 100%
>>>>>>> CPU
>>>>>>> anyway which is what we want to avoid. So from another point of view
>>> the
>>>>>>> accumulation
>>>>>>> of IDL changes is a secondary effect which is still desired, but not
>>> the
>>>>>>> main purpose.
>>>>>>>
>>>>>>> Also delaying by short time if the previous run was short is fine, you
>>>>> are
>>>>>>> right that we don't
>>>>>>> accumulate enough however during short running times there is a high
>>>>> chance
>>>>>>> that the
>>>>>>> northd would get to sleep anyway (We will help it to sleep at least a
>>>>> bit
>>>>>>> nevertheless).
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but
>>>>> haven't
>>>>>>>> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait()
>>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop of
>>>>>>>> sleeping and waking because of this? I don't think it should, since
>>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions waited on
>>>>> by
>>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check.
>>>>>>>>
>>>>>>>
>>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will
>>> process
>>>>>>> anything
>>>>>>> that it can and wait will be fine. The problem would be if we would
>>>>> skip the
>>>>>>> ovsdb_idl_loop_run() for some reason.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Next, does this have any negative impact on our ability to perform
>>>>>>>> incremental processing in ovn-northd? My concern is that since we are
>>>>>>>> still running the ovsdb IDL loop that if multiple NB changes occur
>>>>>>>> during our delay, then we might have to fall back to a full recompute
>>>>>>>> instead of being able to incrementally process the changes. Are my
>>>>>>>> concerns valid?
>>>>>>>>
>>>>>>>
>>>>>>> I suppose that can happen if there are changes that could result in
>>>>>>> "conflict"
>>>>>>> and full recompute. During the test we haven't seen anything like
>>> that.
>>>>>>> The odds of that happening are small because as stated previously we
>>> are
>>>>>>> doing
>>>>>>> basically the same as if the engine was running for a long time always
>>>>> from
>>>>>>> the IDL
>>>>>>> point of view except that we give IDL a chance to process whatever has
>>>>>>> pilled up
>>>>>>> within the sleep period.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Next, has scale testing shown that this change has made a positive
>>>>>>>> impact? If so, is there any recommendation for how to determine what
>>> to
>>>>>>>> configure the value to?
>>>>>>>>
>>>>>>>>
>>>>>>> It has a huge impact actually the value tested was 200 ms, the
>>>>>>> recommendation
>>>>>>
>>>>>> This was chosen based on the historical data from similar tests which
>>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale.
>>>>>>
>>>>>>> would be < 500 ms. After that point the latency on components creation
>>>>>>> would be
>>>>>>> very noticable. I will put the recommendation into the ovn-nb.xml with
>>>>> the
>>>>>>> latency
>>>>>>> comment. Before I'll post v2 (which has the numbers in commit message)
>>>>> those
>>>>>>> are the test results:
>>>>>>>
>>>>>>> Run without any backoff period:
>>>>>>> northd aggregate CPU 9810% avg / 12765% max
>>>>>>> northd was spinning at 100% CPU the entire second half of the test.
>>>>>>>
>>>>>>> Run with 200 ms max backoff period:
>>>>>>> northd aggregate CPU 6066% avg / 7689% max
>>>>>>> northd was around 60% for the second half of the test
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Finally, is this change expected to be a long-term necessity? This
>>>>>>>> option seems to be useful for cases where northd recomputes are
>>>>>>>> required. Performing recomputes less frequently seems like it would
>>>>>>>> lower the CPU usage of ovn-northd while still processing the same
>>>>> amount
>>>>>>>> of changes. However, once northd can handle most changes
>>> incrementally,
>>>>>>>> is there still a benefit to delaying running? If each run of northd
>>>>>>>> handles all DB changes incrementally, then is there any point in
>>>>> putting
>>>>>>>> delays between those incremental runs?
>>>>>>>>
>>>>>>>>
>>>>>>> Ideally we won't need it in the future. However, the assumption for
>>> not
>>>>>>> needing
>>>>>>> anything like this is that northd will be fast enough to process I-P
>>>>>>> changes and
>>>>>>> be able to sleep between the next batch update arrives from CMS. That
>>>>>>> doesn't
>>>>>>> seem to happen in very near future, one thing to keep in mind is that
>>>>>>> testing
>>>>>>> happened with Numan's I-P for LBs and lflows which make a huge
>>>>> difference,
>>>>>>> but
>>>>>>> still not enough to achieve the mentioned northd state. So from my
>>>>>>> perspective
>>>>>>> it will be relevant for a few releases. And as stated above the point
>>>>> is to
>>>>>>> prevent
>>>>>>> northd to spin at 100% CPU all the time.
>>>>>>>
>>>>>>
>>>>>> +1 it's not the prettiest feature (and some might rightfully call it a
>>>>>> hack) but it seems to me like the cleanest alternative for now, until
>>>>>> northd processing is fully incremental.
>>>>>
>>>>> In most cases it may be fine, but it might be a problem for a worst case
>>>>> scenario:
>>>>>
>>>>> Assume all the changes coming in NB can be incrementally processed but
>>> at
>>>>> a very very high rate, and ovn-northd keeps processing the changes
>>>>> incrementally. Since the change rate is so high, ovn-northd barely
>>> keeps up
>>>>> with the changes with 99% CPU load. For example, I-P for each object
>>> takes
>>>>> 10ms, and the change rate is 99 objects/sec. According to this
>>> algorithm,
>>>>> ovn-northd will always sleep for the maximum 200ms between each IDL run,
>>>>> and then ovn-northd would never keep up with the changes any more - the
>>>>> backlog will become longer and longer because of the wasted idle time.
>>>>>
>>>>
>>>> IDL runs are not skipped.  Just I-P engine runs.  So I think this won't
>>>> be a problem, or am I missing something?
>>>
>>> Sorry about the typo, I meant to say between each "engine run" instead of
>>> "IDL run". IDL run is not skipped, but the backlog (accumulated changes in
>>> IDL) becomes longer and longer. E.g.:
>>>
>>> (assume change rate is 100 object/sec)
>>>
>>> run-1: handles 1 object, takes 10ms, sleep 10ms
>>> run-2: handles 2 objects, takes 20ms, sleep 20ms
>>> run-3: handles 4 objects, takes 40ms, sleep 40ms
>>> run-4: handles 8 objects, takes 80ms, sleep 80ms
>>> run-5: handles 16 objects, takes 160ms, sleep 160ms
>>> run-6: handles 32 objects, takes 320ms, sleep 200ms
>>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms, sleep
>>> 200ms
>>> run-8: handles 72 objects, takes 720ms, sleep 200ms
>>> run-9: handles 92 objects, takes 920ms, sleep 200ms
>>> ...
>>> As we can see the backlog grows indefinitely if the input keeps changing at
>>> the rate of 100 obj/s.
>>>
>>
>> I see now, thanks for the explanation.  But isn't this possible today
>> too?  Maybe less probable though.
>>
>> Also, we'll probably hit other issues (e.g., timeouts on CMS side)
>> because of the backlog which will (likely?) throttle the CMS.
>>
>> What would be a good solution?  It seems hard to define what a "large
>> backlog" is.  If we could do that, maybe triggering a full recompute
>> when the backlog is large enough might be OK.
> 
> Hmm.  That's an interesting problem.
> From my understanding, the cost of incremental processing run consists
> of two parts: some more or less constant engine run cost E and a cost
> of processing actual changes P.  Unless CMS is changing the same value
> over and over again, we can't optimize P by accumulating more changes
> per run.  But we can amortize the more or less constant cost E.
> 
> We had a similar problem with the full recompute in the past.  If we
> assume the cost of a recompute R to be more or less constant regardless
> of amount of changes, then we can reduce the total cost by aggregating
> more changes per run.  That's what run_idl_loop() function is trying
> to do.  It runs IDL for as long as there are changes (capped at 500ms).
> 
> We can't really avoid paying the cost P and it will compound with the
> increased number of changes.  The only solution here, AFAICT, is to
> fall back to full recompute once P > R.  Maybe that can be done by
> somehow calculating the number of tracked vs real rows in IDL.
> Current ovn-northd has this problem and introduction of delays will
> amplify it.
> 

That is still hard to do unfortunately.  The ratio of IDL tracked vs
"real" rows is still not a good enough indication.  It's possible that
northd can quickly incrementally process row updates for table T1 but
that it has to do intensive computations for row updates for table T2.

A simpler (dumb?) approach is to just periodically recompute.  But that
shouldn't happen too often because then we start using a lot of CPU
again. :)

> 
> The problem this patch is trying to address, IIUC, is that the cost E
> is still high.  So, I'm wondering if the solution can be similar to
> what we already have for a full recompute.  Current run_idl_loop()
> runs the IDL only while there are changes and it will not continue
> running it if the next update is delayed even by a very short time
> interval.  What we could potentially do here is to wait for some
> very short time interval (maybe configurable or maybe not) and check
> the IDL for changes again.  This will allow us to better batch updates
> and amortize constant engine run cost E without need to sleep for
> arbitrary 200ms. (We will sleep for arbitrary 1-10ms at a time, but
> that seems less arbitrary. :D ).

I think we still need to set a max limit for this sequence of short
sleeps and that max limit needs to be configurable by the user because
it may increase end-to-end latency for propagating NB changes to
ovn-controller.

In that case I think I prefer an explicit "arbitrary" max backoff set by
the user like this patch initially proposed.  But I'm not completely
against other options.

> 
> Something like (for illustration purposes; not tested):
> 
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 4fa1b039e..54e4ecb5f 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -689,22 +689,39 @@ run_idl_loop(struct ovsdb_idl_loop *idl_loop, const 
> char *name)
>      unsigned long long duration, start = time_msec();
>      unsigned int seqno = UINT_MAX;
>      struct ovsdb_idl_txn *txn;
> +    int n_before_sleep = -1;
>      int n = 0;
>  
>      /* Accumulate database changes as long as there are some,
>       * but no longer than half a second. */
> -    while (seqno != ovsdb_idl_get_seqno(idl_loop->idl)
> -           && time_msec() - start < 500) {
> -        seqno = ovsdb_idl_get_seqno(idl_loop->idl);
> -        ovsdb_idl_run(idl_loop->idl);
> -        n++;
> +    for (;;) {
> +        while (seqno != ovsdb_idl_get_seqno(idl_loop->idl)
> +               && time_msec() - start < 500) {
> +            seqno = ovsdb_idl_get_seqno(idl_loop->idl);
> +            ovsdb_idl_run(idl_loop->idl);
> +            n++;
> +        }
> +        /* If we're not out of time yet, try to sleep for a short 10ms
> +         * in case we'll have more updates.  Don't sleep again if there
> +         * were no updates after the previous short sleep. */
> +        if (n > n_before_sleep + 1 && time_msec() - start < 500) {
> +            n_before_sleep = n;
> +            poll_timer_wait(10);
> +            ovsdb_idl_wait(idl_loop->idl);
> +            poll_block();
> +            /* Reset seqno, so we try to run IDL at least one more time. */
> +            seqno = UINT_MAX;
> +        } else {
> +            /* Out of time or no updates since the last sleep. */
> +            break;
> +        }
>      }
>  
>      txn = ovsdb_idl_loop_run(idl_loop);
>  
>      duration = time_msec() - start;
> -    /* ovsdb_idl_run() is called at least 2 times.  Once directly and
> -     * once in the ovsdb_idl_loop_run().  n > 2 means that we received
> +    /* ovsdb_idl_run() is called at least 3 times.  Once directly and
> +     * once in the ovsdb_idl_loop_run().  n > 3 means that we received
>       * data on at least 2 subsequent calls. */
>      if (n > 2 || duration > 100) {
>          VLOG(duration > 500 ? VLL_INFO : VLL_DBG,
> ---
> 
> I'm not sure how something like this will impact the total CPU usage.
> Hopefully, these short sleeps can allow accumulating more changes and
> amortize the constant costs better while also reducing perceived CPU
> usage.  Needs testing, for sure.  One other advantage of such a solution
> is the code locality, i.e. all the logic being in one place.
> 
> Thoughts?
> 
> Best regards, Ilya Maximets.
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to