On 8/14/23 19:04, Han Zhou wrote: > On Mon, Aug 14, 2023 at 9:54 AM Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> On 8/11/23 15:07, Dumitru Ceara wrote: >>> On 8/10/23 18:38, Ilya Maximets wrote: >>>> On 8/10/23 17:34, Dumitru Ceara wrote: >>>>> On 8/10/23 17:20, Han Zhou wrote: >>>>>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> > wrote: >>>>>>> >>>>>>> On 8/10/23 15:34, Han Zhou wrote: >>>>>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> > wrote: >>>>>>>>> >>>>>>>>> On 8/10/23 08:12, Ales Musil wrote: >>>>>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson < > mmich...@redhat.com> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Ales, >>>>>>>>>>> >>>>>>>>>>> I have some high-level comments/questions about this patch. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Mark, >>>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Ales, Mark, >>>>>>>>> >>>>>>>>>> thank you for the review. See my answers inline below. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I have been privy to the conversations that led to this change. > My >>>>>>>>>>> understanding is that by having ovn-northd wake up immediately, > it >>>>>> can >>>>>>>>>>> be more CPU-intensive than waiting a bit for changes to > accumulate >>>>>> and >>>>>>>>>>> handling all of those at once instead. However, nothing in > either the >>>>>>>>>>> commit message or ovn-nb.xml explains what the purpose of this > new >>>>>>>>>>> configuration option is. I think you should add a sentence or > two to >>>>>>>>>>> explain why someone would want to enable this option. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in > the >>>>>>>> commit >>>>>>>>>> message >>>>>>>>>> together with results from scale run. >>>>>>>>>> >>>>>>>>> >>>>>>>>> +1 we really need to explain why this change is needed. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Next, the algorithm used here strikes me as odd. We use the > previous >>>>>>>> run >>>>>>>>>>> time of ovn-northd to determine how long to wait before running >>>>>> again. >>>>>>>>>>> This delay is capped by the configured backoff time. Let's say > that >>>>>>>>>>> we've configured the backoff interval to be 200 ms. If > ovn-northd >>>>>> has a >>>>>>>>>>> super quick run and only takes 10 ms, then we will only delay > the >>>>>> next >>>>>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the > original >>>>>>>>>>> issue by much, since we are only allowing a maximum of 20 ms > (10 ms >>>>>> for >>>>>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to > accumulate. >>>>>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms > to >>>>>>>>>>> complete, then we would delay the next run by 200ms. In this > case, >>>>>>>>>>> delaying at all seems like it's not necessary since we > potentially >>>>>> have >>>>>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I > would >>>>>>>>>>> have expected the opposite approach to be taken. If someone >>>>>> configures >>>>>>>>>>> 200ms as their backoff interval, I would expect us to always > allow a >>>>>>>>>>> *minimum* of 200ms of NB changes to accumulate before running > again. >>>>>> So >>>>>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then > we >>>>>>>> would >>>>>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If > northd >>>>>> takes >>>>>>>>>>> a long time to recompute and takes 5000 ms, then we would not > wait at >>>>>>>>>>> all before processing changes again. Was the algorithm chosen > based >>>>>> on >>>>>>>>>>> experimentation? Is it a well-known method I'm just not familiar >>>>>> with? >>>>>>>>> >>>>>>>>> I think the main assumption (that should probably be made > explicit in >>>>>>>>> the commit log and/or documentation) is that on average changes > happen >>>>>>>>> in a uniform way. This might not always be accurate. >>>>>>>>> >>>>>>>>> However, if we're off with the estimate, in the worst case we'd be >>>>>>>>> adding the configured max delay to the latency of processing > changes. >>>>>>>>> So, as long as the value is not extremely high, the impact is not > that >>>>>>>>> high either. >>>>>>>>> >>>>>>>>> Last but not least, as this value would be configured by the CMS, > we >>>>>>>>> assume the CMS knows what they're doing. :) >>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm not sure if the algorithm is well known. >>>>>>>>>> >>>>>>>>>> The thing is that at scale we almost always cap at the backoff > so it >>>>>> has >>>>>>>>>> probably >>>>>>>>>> the same effect as your suggestion with the difference that we >>>>>> actually >>>>>>>>>> delay even >>>>>>>>>> after long runs. And that is actually desired, it's true that in > the >>>>>>>> let's >>>>>>>>>> say 500 ms >>>>>>>>>> should be enough to accumulate more changes however that can > lead to >>>>>>>> another >>>>>>>>>> 500 ms run and so on. That in the end means that northd will > spin at >>>>>>>> 100% >>>>>>>>>> CPU >>>>>>>>>> anyway which is what we want to avoid. So from another point of > view >>>>>> the >>>>>>>>>> accumulation >>>>>>>>>> of IDL changes is a secondary effect which is still desired, but > not >>>>>> the >>>>>>>>>> main purpose. >>>>>>>>>> >>>>>>>>>> Also delaying by short time if the previous run was short is > fine, you >>>>>>>> are >>>>>>>>>> right that we don't >>>>>>>>>> accumulate enough however during short running times there is a > high >>>>>>>> chance >>>>>>>>>> that the >>>>>>>>>> northd would get to sleep anyway (We will help it to sleep at > least a >>>>>>>> bit >>>>>>>>>> nevertheless). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but >>>>>>>> haven't >>>>>>>>>>> changed the ovsdb_idl_loop_run() or > ovsdb_idl_loop_commit_and_wait() >>>>>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop > of >>>>>>>>>>> sleeping and waking because of this? I don't think it should, > since >>>>>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions > waited on >>>>>>>> by >>>>>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will >>>>>> process >>>>>>>>>> anything >>>>>>>>>> that it can and wait will be fine. The problem would be if we > would >>>>>>>> skip the >>>>>>>>>> ovsdb_idl_loop_run() for some reason. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Next, does this have any negative impact on our ability to > perform >>>>>>>>>>> incremental processing in ovn-northd? My concern is that since > we are >>>>>>>>>>> still running the ovsdb IDL loop that if multiple NB changes > occur >>>>>>>>>>> during our delay, then we might have to fall back to a full > recompute >>>>>>>>>>> instead of being able to incrementally process the changes. Are > my >>>>>>>>>>> concerns valid? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I suppose that can happen if there are changes that could result > in >>>>>>>>>> "conflict" >>>>>>>>>> and full recompute. During the test we haven't seen anything like >>>>>> that. >>>>>>>>>> The odds of that happening are small because as stated > previously we >>>>>> are >>>>>>>>>> doing >>>>>>>>>> basically the same as if the engine was running for a long time > always >>>>>>>> from >>>>>>>>>> the IDL >>>>>>>>>> point of view except that we give IDL a chance to process > whatever has >>>>>>>>>> pilled up >>>>>>>>>> within the sleep period. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Next, has scale testing shown that this change has made a > positive >>>>>>>>>>> impact? If so, is there any recommendation for how to determine > what >>>>>> to >>>>>>>>>>> configure the value to? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> It has a huge impact actually the value tested was 200 ms, the >>>>>>>>>> recommendation >>>>>>>>> >>>>>>>>> This was chosen based on the historical data from similar tests > which >>>>>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale. >>>>>>>>> >>>>>>>>>> would be < 500 ms. After that point the latency on components > creation >>>>>>>>>> would be >>>>>>>>>> very noticable. I will put the recommendation into the > ovn-nb.xml with >>>>>>>> the >>>>>>>>>> latency >>>>>>>>>> comment. Before I'll post v2 (which has the numbers in commit > message) >>>>>>>> those >>>>>>>>>> are the test results: >>>>>>>>>> >>>>>>>>>> Run without any backoff period: >>>>>>>>>> northd aggregate CPU 9810% avg / 12765% max >>>>>>>>>> northd was spinning at 100% CPU the entire second half of the > test. >>>>>>>>>> >>>>>>>>>> Run with 200 ms max backoff period: >>>>>>>>>> northd aggregate CPU 6066% avg / 7689% max >>>>>>>>>> northd was around 60% for the second half of the test >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Finally, is this change expected to be a long-term necessity? > This >>>>>>>>>>> option seems to be useful for cases where northd recomputes are >>>>>>>>>>> required. Performing recomputes less frequently seems like it > would >>>>>>>>>>> lower the CPU usage of ovn-northd while still processing the > same >>>>>>>> amount >>>>>>>>>>> of changes. However, once northd can handle most changes >>>>>> incrementally, >>>>>>>>>>> is there still a benefit to delaying running? If each run of > northd >>>>>>>>>>> handles all DB changes incrementally, then is there any point in >>>>>>>> putting >>>>>>>>>>> delays between those incremental runs? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Ideally we won't need it in the future. However, the assumption > for >>>>>> not >>>>>>>>>> needing >>>>>>>>>> anything like this is that northd will be fast enough to process > I-P >>>>>>>>>> changes and >>>>>>>>>> be able to sleep between the next batch update arrives from CMS. > That >>>>>>>>>> doesn't >>>>>>>>>> seem to happen in very near future, one thing to keep in mind is > that >>>>>>>>>> testing >>>>>>>>>> happened with Numan's I-P for LBs and lflows which make a huge >>>>>>>> difference, >>>>>>>>>> but >>>>>>>>>> still not enough to achieve the mentioned northd state. So from > my >>>>>>>>>> perspective >>>>>>>>>> it will be relevant for a few releases. And as stated above the > point >>>>>>>> is to >>>>>>>>>> prevent >>>>>>>>>> northd to spin at 100% CPU all the time. >>>>>>>>>> >>>>>>>>> >>>>>>>>> +1 it's not the prettiest feature (and some might rightfully call > it a >>>>>>>>> hack) but it seems to me like the cleanest alternative for now, > until >>>>>>>>> northd processing is fully incremental. >>>>>>>> >>>>>>>> In most cases it may be fine, but it might be a problem for a > worst case >>>>>>>> scenario: >>>>>>>> >>>>>>>> Assume all the changes coming in NB can be incrementally processed > but >>>>>> at >>>>>>>> a very very high rate, and ovn-northd keeps processing the changes >>>>>>>> incrementally. Since the change rate is so high, ovn-northd barely >>>>>> keeps up >>>>>>>> with the changes with 99% CPU load. For example, I-P for each > object >>>>>> takes >>>>>>>> 10ms, and the change rate is 99 objects/sec. According to this >>>>>> algorithm, >>>>>>>> ovn-northd will always sleep for the maximum 200ms between each > IDL run, >>>>>>>> and then ovn-northd would never keep up with the changes any more > - the >>>>>>>> backlog will become longer and longer because of the wasted idle > time. >>>>>>>> >>>>>>> >>>>>>> IDL runs are not skipped. Just I-P engine runs. So I think this > won't >>>>>>> be a problem, or am I missing something? >>>>>> >>>>>> Sorry about the typo, I meant to say between each "engine run" > instead of >>>>>> "IDL run". IDL run is not skipped, but the backlog (accumulated > changes in >>>>>> IDL) becomes longer and longer. E.g.: >>>>>> >>>>>> (assume change rate is 100 object/sec) >>>>>> >>>>>> run-1: handles 1 object, takes 10ms, sleep 10ms >>>>>> run-2: handles 2 objects, takes 20ms, sleep 20ms >>>>>> run-3: handles 4 objects, takes 40ms, sleep 40ms >>>>>> run-4: handles 8 objects, takes 80ms, sleep 80ms >>>>>> run-5: handles 16 objects, takes 160ms, sleep 160ms >>>>>> run-6: handles 32 objects, takes 320ms, sleep 200ms >>>>>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes > 520ms, sleep >>>>>> 200ms >>>>>> run-8: handles 72 objects, takes 720ms, sleep 200ms >>>>>> run-9: handles 92 objects, takes 920ms, sleep 200ms >>>>>> ... >>>>>> As we can see the backlog grows indefinitely if the input keeps > changing at >>>>>> the rate of 100 obj/s. >>>>>> >>>>> >>>>> I see now, thanks for the explanation. But isn't this possible today >>>>> too? Maybe less probable though. >>>>> >>>>> Also, we'll probably hit other issues (e.g., timeouts on CMS side) >>>>> because of the backlog which will (likely?) throttle the CMS. >>>>> >>>>> What would be a good solution? It seems hard to define what a "large >>>>> backlog" is. If we could do that, maybe triggering a full recompute >>>>> when the backlog is large enough might be OK. >>>> >>>> Hmm. That's an interesting problem. >>>> From my understanding, the cost of incremental processing run consists >>>> of two parts: some more or less constant engine run cost E and a cost >>>> of processing actual changes P. Unless CMS is changing the same value >>>> over and over again, we can't optimize P by accumulating more changes >>>> per run. But we can amortize the more or less constant cost E. >>>> >>>> We had a similar problem with the full recompute in the past. If we >>>> assume the cost of a recompute R to be more or less constant regardless >>>> of amount of changes, then we can reduce the total cost by aggregating >>>> more changes per run. That's what run_idl_loop() function is trying >>>> to do. It runs IDL for as long as there are changes (capped at 500ms). >>>> >>>> We can't really avoid paying the cost P and it will compound with the >>>> increased number of changes. The only solution here, AFAICT, is to >>>> fall back to full recompute once P > R. Maybe that can be done by >>>> somehow calculating the number of tracked vs real rows in IDL. >>>> Current ovn-northd has this problem and introduction of delays will >>>> amplify it. >>>> >>> >>> That is still hard to do unfortunately. The ratio of IDL tracked vs >>> "real" rows is still not a good enough indication. It's possible that >>> northd can quickly incrementally process row updates for table T1 but >>> that it has to do intensive computations for row updates for table T2. >>> >>> A simpler (dumb?) approach is to just periodically recompute. But that >>> shouldn't happen too often because then we start using a lot of CPU >>> again. :) >> >> >> Thinking a bit more about this, if ovn-northd is processing updates >> slower than they are arriving, the database server will eventually >> notice a backlog on the connection and will start combining multiple >> updates into larger ones. In this case northd will have some chances >> to catch up. > > Not really. If ovn-northd processes updates slower than they are arriving, > and if the processing is pure I-P (the processing time is O(N), N = number > of items changed), I think it will never catch up. > In other words, ovn-northd would have a max throughput with I-P. Sleeping > unconditionally would decrease the throughput. > >> However, I think, northd should not sleep under these >> circumstances. One way to detect this situation can be checking the >> time spent in idl run. We already do that as we limit the idl_run >> calls with 500ms maximum duration. So, if we bail from the loop due >> to duration > 500ms, northd should not sleep and get back to polling >> as soon as it finished with previous updates. In other cases it's >> probably fine to sleep a little, since database updates are not >> backlogged, i.e. ovn-northd is able to process them in more or less >> reasonable time. >> > > Yes, using the last loop duration as an indication to avoid sleeping may > avoid the performance degradation in this case. >
I agree, sounds good to me too. Me personally, I vote for a v3 that: - allows the cms to set a max backoff time (essentially what v1 & v2 do) - skips sleeping if we detect idl run loop duration > 500 ms Regards, Dumitru _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev