On 8/12/23 07:08, Han Zhou wrote: > On Fri, Aug 11, 2023 at 6:07 AM Dumitru Ceara <dce...@redhat.com> wrote: >> >> On 8/10/23 18:38, Ilya Maximets wrote: >>> On 8/10/23 17:34, Dumitru Ceara wrote: >>>> On 8/10/23 17:20, Han Zhou wrote: >>>>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> > wrote: >>>>>> >>>>>> On 8/10/23 15:34, Han Zhou wrote: >>>>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> > wrote: >>>>>>>> >>>>>>>> On 8/10/23 08:12, Ales Musil wrote: >>>>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com >> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Ales, >>>>>>>>>> >>>>>>>>>> I have some high-level comments/questions about this patch. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Mark, >>>>>>>>> >>>>>>>> >>>>>>>> Hi Ales, Mark, >>>>>>>> >>>>>>>>> thank you for the review. See my answers inline below. >>>>>>>>> >>>>>>>>> >>>>>>>>>> I have been privy to the conversations that led to this change. > My >>>>>>>>>> understanding is that by having ovn-northd wake up immediately, > it >>>>> can >>>>>>>>>> be more CPU-intensive than waiting a bit for changes to > accumulate >>>>> and >>>>>>>>>> handling all of those at once instead. However, nothing in > either the >>>>>>>>>> commit message or ovn-nb.xml explains what the purpose of this > new >>>>>>>>>> configuration option is. I think you should add a sentence or > two to >>>>>>>>>> explain why someone would want to enable this option. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in > the >>>>>>> commit >>>>>>>>> message >>>>>>>>> together with results from scale run. >>>>>>>>> >>>>>>>> >>>>>>>> +1 we really need to explain why this change is needed. >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Next, the algorithm used here strikes me as odd. We use the > previous >>>>>>> run >>>>>>>>>> time of ovn-northd to determine how long to wait before running >>>>> again. >>>>>>>>>> This delay is capped by the configured backoff time. Let's say > that >>>>>>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd >>>>> has a >>>>>>>>>> super quick run and only takes 10 ms, then we will only delay the >>>>> next >>>>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the > original >>>>>>>>>> issue by much, since we are only allowing a maximum of 20 ms (10 > ms >>>>> for >>>>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate. >>>>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms > to >>>>>>>>>> complete, then we would delay the next run by 200ms. In this > case, >>>>>>>>>> delaying at all seems like it's not necessary since we > potentially >>>>> have >>>>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I > would >>>>>>>>>> have expected the opposite approach to be taken. If someone >>>>> configures >>>>>>>>>> 200ms as their backoff interval, I would expect us to always > allow a >>>>>>>>>> *minimum* of 200ms of NB changes to accumulate before running > again. >>>>> So >>>>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then > we >>>>>>> would >>>>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd >>>>> takes >>>>>>>>>> a long time to recompute and takes 5000 ms, then we would not > wait at >>>>>>>>>> all before processing changes again. Was the algorithm chosen > based >>>>> on >>>>>>>>>> experimentation? Is it a well-known method I'm just not familiar >>>>> with? >>>>>>>> >>>>>>>> I think the main assumption (that should probably be made explicit > in >>>>>>>> the commit log and/or documentation) is that on average changes > happen >>>>>>>> in a uniform way. This might not always be accurate. >>>>>>>> >>>>>>>> However, if we're off with the estimate, in the worst case we'd be >>>>>>>> adding the configured max delay to the latency of processing > changes. >>>>>>>> So, as long as the value is not extremely high, the impact is not > that >>>>>>>> high either. >>>>>>>> >>>>>>>> Last but not least, as this value would be configured by the CMS, > we >>>>>>>> assume the CMS knows what they're doing. :) >>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> I'm not sure if the algorithm is well known. >>>>>>>>> >>>>>>>>> The thing is that at scale we almost always cap at the backoff so > it >>>>> has >>>>>>>>> probably >>>>>>>>> the same effect as your suggestion with the difference that we >>>>> actually >>>>>>>>> delay even >>>>>>>>> after long runs. And that is actually desired, it's true that in > the >>>>>>> let's >>>>>>>>> say 500 ms >>>>>>>>> should be enough to accumulate more changes however that can lead > to >>>>>>> another >>>>>>>>> 500 ms run and so on. That in the end means that northd will spin > at >>>>>>> 100% >>>>>>>>> CPU >>>>>>>>> anyway which is what we want to avoid. So from another point of > view >>>>> the >>>>>>>>> accumulation >>>>>>>>> of IDL changes is a secondary effect which is still desired, but > not >>>>> the >>>>>>>>> main purpose. >>>>>>>>> >>>>>>>>> Also delaying by short time if the previous run was short is > fine, you >>>>>>> are >>>>>>>>> right that we don't >>>>>>>>> accumulate enough however during short running times there is a > high >>>>>>> chance >>>>>>>>> that the >>>>>>>>> northd would get to sleep anyway (We will help it to sleep at > least a >>>>>>> bit >>>>>>>>> nevertheless). >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but >>>>>>> haven't >>>>>>>>>> changed the ovsdb_idl_loop_run() or > ovsdb_idl_loop_commit_and_wait() >>>>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop > of >>>>>>>>>> sleeping and waking because of this? I don't think it should, > since >>>>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions > waited on >>>>>>> by >>>>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. >>>>>>>>>> >>>>>>>>> >>>>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will >>>>> process >>>>>>>>> anything >>>>>>>>> that it can and wait will be fine. The problem would be if we > would >>>>>>> skip the >>>>>>>>> ovsdb_idl_loop_run() for some reason. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Next, does this have any negative impact on our ability to > perform >>>>>>>>>> incremental processing in ovn-northd? My concern is that since > we are >>>>>>>>>> still running the ovsdb IDL loop that if multiple NB changes > occur >>>>>>>>>> during our delay, then we might have to fall back to a full > recompute >>>>>>>>>> instead of being able to incrementally process the changes. Are > my >>>>>>>>>> concerns valid? >>>>>>>>>> >>>>>>>>> >>>>>>>>> I suppose that can happen if there are changes that could result > in >>>>>>>>> "conflict" >>>>>>>>> and full recompute. During the test we haven't seen anything like >>>>> that. >>>>>>>>> The odds of that happening are small because as stated previously > we >>>>> are >>>>>>>>> doing >>>>>>>>> basically the same as if the engine was running for a long time > always >>>>>>> from >>>>>>>>> the IDL >>>>>>>>> point of view except that we give IDL a chance to process > whatever has >>>>>>>>> pilled up >>>>>>>>> within the sleep period. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Next, has scale testing shown that this change has made a > positive >>>>>>>>>> impact? If so, is there any recommendation for how to determine > what >>>>> to >>>>>>>>>> configure the value to? >>>>>>>>>> >>>>>>>>>> >>>>>>>>> It has a huge impact actually the value tested was 200 ms, the >>>>>>>>> recommendation >>>>>>>> >>>>>>>> This was chosen based on the historical data from similar tests > which >>>>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale. >>>>>>>> >>>>>>>>> would be < 500 ms. After that point the latency on components > creation >>>>>>>>> would be >>>>>>>>> very noticable. I will put the recommendation into the ovn-nb.xml > with >>>>>>> the >>>>>>>>> latency >>>>>>>>> comment. Before I'll post v2 (which has the numbers in commit > message) >>>>>>> those >>>>>>>>> are the test results: >>>>>>>>> >>>>>>>>> Run without any backoff period: >>>>>>>>> northd aggregate CPU 9810% avg / 12765% max >>>>>>>>> northd was spinning at 100% CPU the entire second half of the > test. >>>>>>>>> >>>>>>>>> Run with 200 ms max backoff period: >>>>>>>>> northd aggregate CPU 6066% avg / 7689% max >>>>>>>>> northd was around 60% for the second half of the test >>>>>>>>> > > Ales, could you clarify the version of OVN and the type of test for the > above result? Is it on main branch or with Numans patches for LB I-P? Does > the test include LB changes? Are there recomputes during the test? Was the > 100% CPU during the second half of the test caused by recompute or by cost > of I-P? >
I'll try to reply to these but Ales, please correct me if I'm wrong. Version used for testing: OVN main + Numan's "LB + lflow I-P v5". The test includes LB changes, it brings up an 120 OpenShift cluster using the latest OVN-IC support that was added recently to ovn-kubernetes. It runs one "zone per node". The test then creates 14K services and 28K pods distributed across 80 namespaces as fast as possible. The test measures lots of things but we were interested in average and P99 time for pods to be "Ready" (networking ready) and cpu usage for ovn-northd on all nodes. As far as I understand there were still occasional recomputes triggered by missing/failing handlers, e.g. sync_to_sb_lb or lflow missing a handler for SB.LB changes. In any case, the CMS (ovn-kube) was continuously adding ports/LBs/resources to NB, as fast as possible; northd was waking up to process those. An example of logs generated by northd I-P while at scale: 2023-08-08T19:12:27.394Z|1149658|inc_proc_eng|DBG|node: lflow, handler for input northd took 1ms 2023-08-08T19:12:27.552Z|1149766|inc_proc_eng|DBG|node: lb_data, handler for input NB_load_balancer_group took 13ms 2023-08-08T19:12:27.553Z|1149779|inc_proc_eng|DBG|node: sync_to_sb_addr_set, recompute (missing handler for input northd) took 1ms 2023-08-08T19:12:27.684Z|1149782|inc_proc_eng|DBG|node: sync_to_sb_lb, recompute (missing handler for input SB_load_balancer) took 131ms 2023-08-08T19:12:27.685Z|1149794|inc_proc_eng|DBG|node: lflow, handler for input northd took 1ms 2023-08-08T19:12:27.846Z|1149893|inc_proc_eng|DBG|node: lb_data, handler for input NB_load_balancer_group took 13ms 2023-08-08T19:12:27.848Z|1149907|inc_proc_eng|DBG|node: sync_to_sb_addr_set, recompute (missing handler for input northd) took 1ms >> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Finally, is this change expected to be a long-term necessity? > This >>>>>>>>>> option seems to be useful for cases where northd recomputes are >>>>>>>>>> required. Performing recomputes less frequently seems like it > would >>>>>>>>>> lower the CPU usage of ovn-northd while still processing the same >>>>>>> amount >>>>>>>>>> of changes. However, once northd can handle most changes >>>>> incrementally, >>>>>>>>>> is there still a benefit to delaying running? If each run of > northd >>>>>>>>>> handles all DB changes incrementally, then is there any point in >>>>>>> putting >>>>>>>>>> delays between those incremental runs? >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Ideally we won't need it in the future. However, the assumption > for >>>>> not >>>>>>>>> needing >>>>>>>>> anything like this is that northd will be fast enough to process > I-P >>>>>>>>> changes and >>>>>>>>> be able to sleep between the next batch update arrives from CMS. > That >>>>>>>>> doesn't >>>>>>>>> seem to happen in very near future, one thing to keep in mind is > that >>>>>>>>> testing >>>>>>>>> happened with Numan's I-P for LBs and lflows which make a huge >>>>>>> difference, >>>>>>>>> but >>>>>>>>> still not enough to achieve the mentioned northd state. So from my >>>>>>>>> perspective >>>>>>>>> it will be relevant for a few releases. And as stated above the > point >>>>>>> is to >>>>>>>>> prevent >>>>>>>>> northd to spin at 100% CPU all the time. >>>>>>>>> >>>>>>>> >>>>>>>> +1 it's not the prettiest feature (and some might rightfully call > it a >>>>>>>> hack) but it seems to me like the cleanest alternative for now, > until >>>>>>>> northd processing is fully incremental. >>>>>>> >>>>>>> In most cases it may be fine, but it might be a problem for a worst > case >>>>>>> scenario: >>>>>>> >>>>>>> Assume all the changes coming in NB can be incrementally processed > but >>>>> at >>>>>>> a very very high rate, and ovn-northd keeps processing the changes >>>>>>> incrementally. Since the change rate is so high, ovn-northd barely >>>>> keeps up >>>>>>> with the changes with 99% CPU load. For example, I-P for each object >>>>> takes >>>>>>> 10ms, and the change rate is 99 objects/sec. According to this >>>>> algorithm, >>>>>>> ovn-northd will always sleep for the maximum 200ms between each IDL > run, >>>>>>> and then ovn-northd would never keep up with the changes any more - > the >>>>>>> backlog will become longer and longer because of the wasted idle > time. >>>>>>> >>>>>> >>>>>> IDL runs are not skipped. Just I-P engine runs. So I think this > won't >>>>>> be a problem, or am I missing something? >>>>> >>>>> Sorry about the typo, I meant to say between each "engine run" > instead of >>>>> "IDL run". IDL run is not skipped, but the backlog (accumulated > changes in >>>>> IDL) becomes longer and longer. E.g.: >>>>> >>>>> (assume change rate is 100 object/sec) >>>>> >>>>> run-1: handles 1 object, takes 10ms, sleep 10ms >>>>> run-2: handles 2 objects, takes 20ms, sleep 20ms >>>>> run-3: handles 4 objects, takes 40ms, sleep 40ms >>>>> run-4: handles 8 objects, takes 80ms, sleep 80ms >>>>> run-5: handles 16 objects, takes 160ms, sleep 160ms >>>>> run-6: handles 32 objects, takes 320ms, sleep 200ms >>>>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms, > sleep >>>>> 200ms >>>>> run-8: handles 72 objects, takes 720ms, sleep 200ms >>>>> run-9: handles 92 objects, takes 920ms, sleep 200ms >>>>> ... >>>>> As we can see the backlog grows indefinitely if the input keeps > changing at >>>>> the rate of 100 obj/s. >>>>> >>>> >>>> I see now, thanks for the explanation. But isn't this possible today >>>> too? Maybe less probable though. >>>> >>>> Also, we'll probably hit other issues (e.g., timeouts on CMS side) >>>> because of the backlog which will (likely?) throttle the CMS. >>>> >>>> What would be a good solution? It seems hard to define what a "large >>>> backlog" is. If we could do that, maybe triggering a full recompute >>>> when the backlog is large enough might be OK. >>> >>> Hmm. That's an interesting problem. >>> From my understanding, the cost of incremental processing run consists >>> of two parts: some more or less constant engine run cost E and a cost >>> of processing actual changes P. Unless CMS is changing the same value >>> over and over again, we can't optimize P by accumulating more changes >>> per run. But we can amortize the more or less constant cost E. >>> >>> We had a similar problem with the full recompute in the past. If we >>> assume the cost of a recompute R to be more or less constant regardless >>> of amount of changes, then we can reduce the total cost by aggregating >>> more changes per run. That's what run_idl_loop() function is trying >>> to do. It runs IDL for as long as there are changes (capped at 500ms). >>> >>> We can't really avoid paying the cost P and it will compound with the >>> increased number of changes. The only solution here, AFAICT, is to >>> fall back to full recompute once P > R. Maybe that can be done by >>> somehow calculating the number of tracked vs real rows in IDL. >>> Current ovn-northd has this problem and introduction of delays will >>> amplify it. >>> >> >> That is still hard to do unfortunately. The ratio of IDL tracked vs >> "real" rows is still not a good enough indication. It's possible that >> northd can quickly incrementally process row updates for table T1 but >> that it has to do intensive computations for row updates for table T2. >> >> A simpler (dumb?) approach is to just periodically recompute. But that >> shouldn't happen too often because then we start using a lot of CPU >> again. :) >> >>> >>> The problem this patch is trying to address, IIUC, is that the cost E >>> is still high. So, I'm wondering if the solution can be similar to >>> what we already have for a full recompute. > > My assumption was that the cost E is relatively small in the I-P. I thought > the patch was trying to help when there were still many recomputes > triggered. Please correct me if I was wrong. I posted some questions > regarding Ales's test result, to get a better understanding. I thought E was the "constant" non-incremental cost. In any case, P (cost for I-P to successfully run) is also not negligible, e.g., even a successful I-P execution of the lb_data handler for NB.LB_Group changes took 13ms at scale: 2023-08-08T19:12:27.552Z|1149766|inc_proc_eng|DBG|node: lb_data, handler for input NB_load_balancer_group took 13ms Regards, Dumitru > > Thanks, > Han > >>> Current run_idl_loop() >>> runs the IDL only while there are changes and it will not continue >>> running it if the next update is delayed even by a very short time >>> interval. What we could potentially do here is to wait for some >>> very short time interval (maybe configurable or maybe not) and check >>> the IDL for changes again. This will allow us to better batch updates >>> and amortize constant engine run cost E without need to sleep for >>> arbitrary 200ms. (We will sleep for arbitrary 1-10ms at a time, but >>> that seems less arbitrary. :D ). >> >> I think we still need to set a max limit for this sequence of short >> sleeps and that max limit needs to be configurable by the user because >> it may increase end-to-end latency for propagating NB changes to >> ovn-controller. >> >> In that case I think I prefer an explicit "arbitrary" max backoff set by >> the user like this patch initially proposed. But I'm not completely >> against other options. >> >>> >>> Something like (for illustration purposes; not tested): >>> >>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>> index 4fa1b039e..54e4ecb5f 100644 >>> --- a/northd/ovn-northd.c >>> +++ b/northd/ovn-northd.c >>> @@ -689,22 +689,39 @@ run_idl_loop(struct ovsdb_idl_loop *idl_loop, > const char *name) >>> unsigned long long duration, start = time_msec(); >>> unsigned int seqno = UINT_MAX; >>> struct ovsdb_idl_txn *txn; >>> + int n_before_sleep = -1; >>> int n = 0; >>> >>> /* Accumulate database changes as long as there are some, >>> * but no longer than half a second. */ >>> - while (seqno != ovsdb_idl_get_seqno(idl_loop->idl) >>> - && time_msec() - start < 500) { >>> - seqno = ovsdb_idl_get_seqno(idl_loop->idl); >>> - ovsdb_idl_run(idl_loop->idl); >>> - n++; >>> + for (;;) { >>> + while (seqno != ovsdb_idl_get_seqno(idl_loop->idl) >>> + && time_msec() - start < 500) { >>> + seqno = ovsdb_idl_get_seqno(idl_loop->idl); >>> + ovsdb_idl_run(idl_loop->idl); >>> + n++; >>> + } >>> + /* If we're not out of time yet, try to sleep for a short 10ms >>> + * in case we'll have more updates. Don't sleep again if there >>> + * were no updates after the previous short sleep. */ >>> + if (n > n_before_sleep + 1 && time_msec() - start < 500) { >>> + n_before_sleep = n; >>> + poll_timer_wait(10); >>> + ovsdb_idl_wait(idl_loop->idl); >>> + poll_block(); >>> + /* Reset seqno, so we try to run IDL at least one more > time. */ >>> + seqno = UINT_MAX; >>> + } else { >>> + /* Out of time or no updates since the last sleep. */ >>> + break; >>> + } >>> } >>> >>> txn = ovsdb_idl_loop_run(idl_loop); >>> >>> duration = time_msec() - start; >>> - /* ovsdb_idl_run() is called at least 2 times. Once directly and >>> - * once in the ovsdb_idl_loop_run(). n > 2 means that we received >>> + /* ovsdb_idl_run() is called at least 3 times. Once directly and >>> + * once in the ovsdb_idl_loop_run(). n > 3 means that we received >>> * data on at least 2 subsequent calls. */ >>> if (n > 2 || duration > 100) { >>> VLOG(duration > 500 ? VLL_INFO : VLL_DBG, >>> --- >>> >>> I'm not sure how something like this will impact the total CPU usage. >>> Hopefully, these short sleeps can allow accumulating more changes and >>> amortize the constant costs better while also reducing perceived CPU >>> usage. Needs testing, for sure. One other advantage of such a solution >>> is the code locality, i.e. all the logic being in one place. >>> >>> Thoughts? >>> >>> Best regards, Ilya Maximets. >>> >> > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev