On Mon, Aug 14, 2023 at 9:54 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 8/11/23 15:07, Dumitru Ceara wrote: > > On 8/10/23 18:38, Ilya Maximets wrote: > >> On 8/10/23 17:34, Dumitru Ceara wrote: > >>> On 8/10/23 17:20, Han Zhou wrote: > >>>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> wrote: > >>>>> > >>>>> On 8/10/23 15:34, Han Zhou wrote: > >>>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote: > >>>>>>> > >>>>>>> On 8/10/23 08:12, Ales Musil wrote: > >>>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson < mmich...@redhat.com> > >>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Ales, > >>>>>>>>> > >>>>>>>>> I have some high-level comments/questions about this patch. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi Mark, > >>>>>>>> > >>>>>>> > >>>>>>> Hi Ales, Mark, > >>>>>>> > >>>>>>>> thank you for the review. See my answers inline below. > >>>>>>>> > >>>>>>>> > >>>>>>>>> I have been privy to the conversations that led to this change. My > >>>>>>>>> understanding is that by having ovn-northd wake up immediately, it > >>>> can > >>>>>>>>> be more CPU-intensive than waiting a bit for changes to accumulate > >>>> and > >>>>>>>>> handling all of those at once instead. However, nothing in either the > >>>>>>>>> commit message or ovn-nb.xml explains what the purpose of this new > >>>>>>>>> configuration option is. I think you should add a sentence or two to > >>>>>>>>> explain why someone would want to enable this option. > >>>>>>>>> > >>>>>>>>> > >>>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in the > >>>>>> commit > >>>>>>>> message > >>>>>>>> together with results from scale run. > >>>>>>>> > >>>>>>> > >>>>>>> +1 we really need to explain why this change is needed. > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Next, the algorithm used here strikes me as odd. We use the previous > >>>>>> run > >>>>>>>>> time of ovn-northd to determine how long to wait before running > >>>> again. > >>>>>>>>> This delay is capped by the configured backoff time. Let's say that > >>>>>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd > >>>> has a > >>>>>>>>> super quick run and only takes 10 ms, then we will only delay the > >>>> next > >>>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the original > >>>>>>>>> issue by much, since we are only allowing a maximum of 20 ms (10 ms > >>>> for > >>>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate. > >>>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms to > >>>>>>>>> complete, then we would delay the next run by 200ms. In this case, > >>>>>>>>> delaying at all seems like it's not necessary since we potentially > >>>> have > >>>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I would > >>>>>>>>> have expected the opposite approach to be taken. If someone > >>>> configures > >>>>>>>>> 200ms as their backoff interval, I would expect us to always allow a > >>>>>>>>> *minimum* of 200ms of NB changes to accumulate before running again. > >>>> So > >>>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then we > >>>>>> would > >>>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd > >>>> takes > >>>>>>>>> a long time to recompute and takes 5000 ms, then we would not wait at > >>>>>>>>> all before processing changes again. Was the algorithm chosen based > >>>> on > >>>>>>>>> experimentation? Is it a well-known method I'm just not familiar > >>>> with? > >>>>>>> > >>>>>>> I think the main assumption (that should probably be made explicit in > >>>>>>> the commit log and/or documentation) is that on average changes happen > >>>>>>> in a uniform way. This might not always be accurate. > >>>>>>> > >>>>>>> However, if we're off with the estimate, in the worst case we'd be > >>>>>>> adding the configured max delay to the latency of processing changes. > >>>>>>> So, as long as the value is not extremely high, the impact is not that > >>>>>>> high either. > >>>>>>> > >>>>>>> Last but not least, as this value would be configured by the CMS, we > >>>>>>> assume the CMS knows what they're doing. :) > >>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> I'm not sure if the algorithm is well known. > >>>>>>>> > >>>>>>>> The thing is that at scale we almost always cap at the backoff so it > >>>> has > >>>>>>>> probably > >>>>>>>> the same effect as your suggestion with the difference that we > >>>> actually > >>>>>>>> delay even > >>>>>>>> after long runs. And that is actually desired, it's true that in the > >>>>>> let's > >>>>>>>> say 500 ms > >>>>>>>> should be enough to accumulate more changes however that can lead to > >>>>>> another > >>>>>>>> 500 ms run and so on. That in the end means that northd will spin at > >>>>>> 100% > >>>>>>>> CPU > >>>>>>>> anyway which is what we want to avoid. So from another point of view > >>>> the > >>>>>>>> accumulation > >>>>>>>> of IDL changes is a secondary effect which is still desired, but not > >>>> the > >>>>>>>> main purpose. > >>>>>>>> > >>>>>>>> Also delaying by short time if the previous run was short is fine, you > >>>>>> are > >>>>>>>> right that we don't > >>>>>>>> accumulate enough however during short running times there is a high > >>>>>> chance > >>>>>>>> that the > >>>>>>>> northd would get to sleep anyway (We will help it to sleep at least a > >>>>>> bit > >>>>>>>> nevertheless). > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but > >>>>>> haven't > >>>>>>>>> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait() > >>>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop of > >>>>>>>>> sleeping and waking because of this? I don't think it should, since > >>>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions waited on > >>>>>> by > >>>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. > >>>>>>>>> > >>>>>>>> > >>>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will > >>>> process > >>>>>>>> anything > >>>>>>>> that it can and wait will be fine. The problem would be if we would > >>>>>> skip the > >>>>>>>> ovsdb_idl_loop_run() for some reason. > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Next, does this have any negative impact on our ability to perform > >>>>>>>>> incremental processing in ovn-northd? My concern is that since we are > >>>>>>>>> still running the ovsdb IDL loop that if multiple NB changes occur > >>>>>>>>> during our delay, then we might have to fall back to a full recompute > >>>>>>>>> instead of being able to incrementally process the changes. Are my > >>>>>>>>> concerns valid? > >>>>>>>>> > >>>>>>>> > >>>>>>>> I suppose that can happen if there are changes that could result in > >>>>>>>> "conflict" > >>>>>>>> and full recompute. During the test we haven't seen anything like > >>>> that. > >>>>>>>> The odds of that happening are small because as stated previously we > >>>> are > >>>>>>>> doing > >>>>>>>> basically the same as if the engine was running for a long time always > >>>>>> from > >>>>>>>> the IDL > >>>>>>>> point of view except that we give IDL a chance to process whatever has > >>>>>>>> pilled up > >>>>>>>> within the sleep period. > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Next, has scale testing shown that this change has made a positive > >>>>>>>>> impact? If so, is there any recommendation for how to determine what > >>>> to > >>>>>>>>> configure the value to? > >>>>>>>>> > >>>>>>>>> > >>>>>>>> It has a huge impact actually the value tested was 200 ms, the > >>>>>>>> recommendation > >>>>>>> > >>>>>>> This was chosen based on the historical data from similar tests which > >>>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale. > >>>>>>> > >>>>>>>> would be < 500 ms. After that point the latency on components creation > >>>>>>>> would be > >>>>>>>> very noticable. I will put the recommendation into the ovn-nb.xml with > >>>>>> the > >>>>>>>> latency > >>>>>>>> comment. Before I'll post v2 (which has the numbers in commit message) > >>>>>> those > >>>>>>>> are the test results: > >>>>>>>> > >>>>>>>> Run without any backoff period: > >>>>>>>> northd aggregate CPU 9810% avg / 12765% max > >>>>>>>> northd was spinning at 100% CPU the entire second half of the test. > >>>>>>>> > >>>>>>>> Run with 200 ms max backoff period: > >>>>>>>> northd aggregate CPU 6066% avg / 7689% max > >>>>>>>> northd was around 60% for the second half of the test > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Finally, is this change expected to be a long-term necessity? This > >>>>>>>>> option seems to be useful for cases where northd recomputes are > >>>>>>>>> required. Performing recomputes less frequently seems like it would > >>>>>>>>> lower the CPU usage of ovn-northd while still processing the same > >>>>>> amount > >>>>>>>>> of changes. However, once northd can handle most changes > >>>> incrementally, > >>>>>>>>> is there still a benefit to delaying running? If each run of northd > >>>>>>>>> handles all DB changes incrementally, then is there any point in > >>>>>> putting > >>>>>>>>> delays between those incremental runs? > >>>>>>>>> > >>>>>>>>> > >>>>>>>> Ideally we won't need it in the future. However, the assumption for > >>>> not > >>>>>>>> needing > >>>>>>>> anything like this is that northd will be fast enough to process I-P > >>>>>>>> changes and > >>>>>>>> be able to sleep between the next batch update arrives from CMS. That > >>>>>>>> doesn't > >>>>>>>> seem to happen in very near future, one thing to keep in mind is that > >>>>>>>> testing > >>>>>>>> happened with Numan's I-P for LBs and lflows which make a huge > >>>>>> difference, > >>>>>>>> but > >>>>>>>> still not enough to achieve the mentioned northd state. So from my > >>>>>>>> perspective > >>>>>>>> it will be relevant for a few releases. And as stated above the point > >>>>>> is to > >>>>>>>> prevent > >>>>>>>> northd to spin at 100% CPU all the time. > >>>>>>>> > >>>>>>> > >>>>>>> +1 it's not the prettiest feature (and some might rightfully call it a > >>>>>>> hack) but it seems to me like the cleanest alternative for now, until > >>>>>>> northd processing is fully incremental. > >>>>>> > >>>>>> In most cases it may be fine, but it might be a problem for a worst case > >>>>>> scenario: > >>>>>> > >>>>>> Assume all the changes coming in NB can be incrementally processed but > >>>> at > >>>>>> a very very high rate, and ovn-northd keeps processing the changes > >>>>>> incrementally. Since the change rate is so high, ovn-northd barely > >>>> keeps up > >>>>>> with the changes with 99% CPU load. For example, I-P for each object > >>>> takes > >>>>>> 10ms, and the change rate is 99 objects/sec. According to this > >>>> algorithm, > >>>>>> ovn-northd will always sleep for the maximum 200ms between each IDL run, > >>>>>> and then ovn-northd would never keep up with the changes any more - the > >>>>>> backlog will become longer and longer because of the wasted idle time. > >>>>>> > >>>>> > >>>>> IDL runs are not skipped. Just I-P engine runs. So I think this won't > >>>>> be a problem, or am I missing something? > >>>> > >>>> Sorry about the typo, I meant to say between each "engine run" instead of > >>>> "IDL run". IDL run is not skipped, but the backlog (accumulated changes in > >>>> IDL) becomes longer and longer. E.g.: > >>>> > >>>> (assume change rate is 100 object/sec) > >>>> > >>>> run-1: handles 1 object, takes 10ms, sleep 10ms > >>>> run-2: handles 2 objects, takes 20ms, sleep 20ms > >>>> run-3: handles 4 objects, takes 40ms, sleep 40ms > >>>> run-4: handles 8 objects, takes 80ms, sleep 80ms > >>>> run-5: handles 16 objects, takes 160ms, sleep 160ms > >>>> run-6: handles 32 objects, takes 320ms, sleep 200ms > >>>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms, sleep > >>>> 200ms > >>>> run-8: handles 72 objects, takes 720ms, sleep 200ms > >>>> run-9: handles 92 objects, takes 920ms, sleep 200ms > >>>> ... > >>>> As we can see the backlog grows indefinitely if the input keeps changing at > >>>> the rate of 100 obj/s. > >>>> > >>> > >>> I see now, thanks for the explanation. But isn't this possible today > >>> too? Maybe less probable though. > >>> > >>> Also, we'll probably hit other issues (e.g., timeouts on CMS side) > >>> because of the backlog which will (likely?) throttle the CMS. > >>> > >>> What would be a good solution? It seems hard to define what a "large > >>> backlog" is. If we could do that, maybe triggering a full recompute > >>> when the backlog is large enough might be OK. > >> > >> Hmm. That's an interesting problem. > >> From my understanding, the cost of incremental processing run consists > >> of two parts: some more or less constant engine run cost E and a cost > >> of processing actual changes P. Unless CMS is changing the same value > >> over and over again, we can't optimize P by accumulating more changes > >> per run. But we can amortize the more or less constant cost E. > >> > >> We had a similar problem with the full recompute in the past. If we > >> assume the cost of a recompute R to be more or less constant regardless > >> of amount of changes, then we can reduce the total cost by aggregating > >> more changes per run. That's what run_idl_loop() function is trying > >> to do. It runs IDL for as long as there are changes (capped at 500ms). > >> > >> We can't really avoid paying the cost P and it will compound with the > >> increased number of changes. The only solution here, AFAICT, is to > >> fall back to full recompute once P > R. Maybe that can be done by > >> somehow calculating the number of tracked vs real rows in IDL. > >> Current ovn-northd has this problem and introduction of delays will > >> amplify it. > >> > > > > That is still hard to do unfortunately. The ratio of IDL tracked vs > > "real" rows is still not a good enough indication. It's possible that > > northd can quickly incrementally process row updates for table T1 but > > that it has to do intensive computations for row updates for table T2. > > > > A simpler (dumb?) approach is to just periodically recompute. But that > > shouldn't happen too often because then we start using a lot of CPU > > again. :) > > > Thinking a bit more about this, if ovn-northd is processing updates > slower than they are arriving, the database server will eventually > notice a backlog on the connection and will start combining multiple > updates into larger ones. In this case northd will have some chances > to catch up.
Not really. If ovn-northd processes updates slower than they are arriving, and if the processing is pure I-P (the processing time is O(N), N = number of items changed), I think it will never catch up. In other words, ovn-northd would have a max throughput with I-P. Sleeping unconditionally would decrease the throughput. > However, I think, northd should not sleep under these > circumstances. One way to detect this situation can be checking the > time spent in idl run. We already do that as we limit the idl_run > calls with 500ms maximum duration. So, if we bail from the loop due > to duration > 500ms, northd should not sleep and get back to polling > as soon as it finished with previous updates. In other cases it's > probably fine to sleep a little, since database updates are not > backlogged, i.e. ovn-northd is able to process them in more or less > reasonable time. > Yes, using the last loop duration as an indication to avoid sleeping may avoid the performance degradation in this case. Thanks, Han > Thoughts? > > Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev