On 12/16/23 15:23, Frode Nordahl wrote: > On Fri, Dec 15, 2023 at 11:53 PM Ilya Maximets <[email protected]> wrote: >> >> On 12/13/23 23:19, Frode Nordahl wrote: >>> Since the topic of clustered database schema conversion issues was >>> first brought up [0], there has been several performance >>> improvements which has greatly improved the situation. >>> >>> However, we cannot get away from the fact that the process remains >>> unpredictable, when faced with a cluster that has grown to a scale >>> no longer supported by its configuration. >>> >>> The OVSDB Server is mostly single threaded, and for clustered >>> databases the schema conversion is done server side in the main >>> thread. While conversion is ongoing the server will not service >>> critical cluster communication, such as Raft elections. >>> >>> In an ideal world operators would regularly re-assess their >>> deployment configurations with real data to ensure parameters such >>> as the OVSDB Raft election-timer is appropriately configured for >>> all cluster operations, including schema conversions. >>> >>> The reality is that this does not happen, and operators are instead >>> discovering the hard facts of managing a OVSDB cluster at critical >>> points in time such as during database upgrades. >>> >>> To alleviate this situation we introduce time keeping in the >>> ovsdb_convert functions for clustered databases. >>> >>> A timeout is deduced from the configured election timer using a >>> divisor of 4 to compute buffer time. Buffer time is the time >>> required for writing the result of the conversion to the raft log >>> and initiating command to replace data on peers before an election >>> is held for the next term. >>> >>> For conversions initiated by a client (such as ovsdb-client), >>> the process will be aborted and an error will be returned if the >>> schema conversion process overruns the timeout. Conversions >>> initiated from storage will not abort, a warning indicating the election >>> timer being set too low will be logged instead. >>> >>> This will allow a human or automatic operator to take necessary >>> action, such as increasing the election timer, before attempting >>> the schema conversion again. >>> >>> 0: >>> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052140.html >>> Signed-off-by: Frode Nordahl <[email protected]> >>> --- >> >> Hi, Frode. Thanks for the patch! > > Thank you for taking the time to review, much appreciated! > >> It reminds me of the previous unfinished work from Anton: >> >> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/ >> >> Looks like the same general idea. There are a few issues with this >> kind of functionality though: > > Thanks for the pointer to this attempt, I missed it. > >> - The task is not finished, i.e. we have to give up and fail, which >> is not nice for the user. >> >> - Leaky abstraction. This patch requires ovsdb/file module to know >> about RAFT, which is largely irrelevant to that module. Can be fixed >> by handing off the configuration from top to bottom (see the Anton's >> implementation), but in this case will require all the intermediate >> layers to pass the information along, piercing the whole hierarchy. >> >> - The issue appears to not be fully addressed. In my previous testing, >> IIRC, sending back all the initial monitor replies after conversion >> was a heavier operation than a conversion itself. So, even if >> conversion doesn't time out, jsonrpc server may cause the timeout. >> So, we kind of need both parts of the puzzle. > > While preparing the proposal I took into account the fact that we have > to send the result of the conversion to our peers, and found it > plausible that the peers would not start an election while being busy > processing the append request. > > I did not take into account the effect of disconnecting all the > *clients*, which I presume you are referring to above, and the > thundering herd effect we get there is indeed a very good point. > >> - Also, only timing the conversion is a little arbitrary. The timer/4 >> is chosen, but there is no guarantee that 3/4 of the timer are enough >> for everything else to run. Anton's solution was trying to keep track >> of time for every operation and subtract from the budget, but that >> becomes a trade off between how deep in the call stack we want to >> be able to yield (foreshadowing :) ) vs at how many levels we need to >> implement time budget tracking. > > This is indeed the weakest point of the proposal. I found no simple > way to determine the time required for the remaining tasks after the > data processing. > >> >> So, I got another idea to discuss (It might have been mentioned somewhere >> in the past years, but I do not remember). >> >> Can we make use of some cooperative multitasking here? >> I'm not talking about full-fledged co-routines, but something simpler, >> though similar in nature. >> >> Let's say we create a library with 2 functions: >> >> 1. cooperative_multitasking_register(callback, arg, time_threshold); >> 2. cooperative_yield(); >> >> Different modules could register a function in this module specifying >> a how much time should pass between calls. Other modules that are >> expected to have heavy processing will call the yield() function at >> points where they feel comfortable to do so. The yield() function >> may look through all the registered callbacks and call ones that wasn't >> called for the duration of their time threshold. Maybe we'll need >> a third function for a callback to call to update the last used time. >> Might be useful if the callback is called directly at some point and >> not from the yeild(). >> >> Some strict rules will be needed for such a library though. i.e. >> the module that registers itself with a callback must not use the >> yield functionality inside nor it should be possible to do so via >> calls to other modules. Not sure if recursion detection is enough, >> but surely needed. And basically, the module that registers the >> callback should be self-sufficient, i.e. the internal state of that >> module should not matter to the outside world, at least it should >> not matter for the call stack that enters the yield(). >> yield() must not be called from places that can loop indefinitely, >> only in places that eventually end, otherwise it may give a false >> impression that the server is working fine while it is stuck and >> not actually doing any useful work. >> >> RAFT module though is very well isolated from the rest of the ovsdb >> functionality, so it should be safe to call raft_run() or storage_run() >> from almost anywhere else outside of the RAFT module itself. Lower >> layer libraries such as jsonrpc.c or ovsdb/log should not yield(), >> since they can be used by the RAFT module. >> >> So, the RAFT module could register itself with a time threshold >> derived from an election timer, and jsonrpc-server and conversion >> may yield(). >> >> Advantages of such a solution: >> >> - Less code, I hope. :) No need for manual time tracking everywhere. >> >> - Easier to add yield() call to other places, if necessary. >> >> - A bit easier to track control flow. Everything still runs to >> completion, no breaking out of the loops, no need to save states >> to revisit later. > > Yes, I really like this part of the idea. > > I have been thinking about some way of converting the ovsdb-server > main loop to an asynchronous state machine to be able to deal with > things like this, but as you touch on above it complicates every part > of the implementation, making it less desirable. > >> - No need to fail the conversion! >> >> - Unlike multi-threading, no need to worry much about data races. > > Indeed async > multi-threading. > >> - May be useful somewhere else? > > Sounds generally useful to me. > >> Disadvantages: >> >> - Need to be careful on which functions can yield() and which callbacks >> can be registered, to avoid modifications of data structures in use >> by the yielding function and the call chain above it. E.g. we >> shouldn't modify lists that a yielding function iterates. > > OTOH, this is no worse than the care we would have to take if we had > wanted to go a threaded route. Not sure how much of this we can > enforce in code, but we definitely need some rigor in reviewing > anything including the use of this library. > >> We have something similar in the python IDL code: >> >> https://github.com/openvswitch/ovs/commit/d28c5ca57650d6866453d0adb9a2e048cda21a86 >> It's not intended for the python library itself, but for the >> application that uses asyncio/eventlet to be able to run housekeeping >> of these frameworks while IDL is processing large updates. > > Thanks for the pointer. > >> What do you think? Does that make any sense? > > Thank you for the counter proposal, and it does make a lot of sense to me. > > The main worry I have while thinking about how this could play out is > what would happen if raft_run() called from the ovsdb_convert yield() > function processes append requests that would conflict with the > on-going conversion. But I guess that would just lead to a retryable > error condition reported back to the caller, which is fine, this is a > distributed system after all. And it would most likely be acceptable > to recommend limiting writes to the cluster while attempting a schema > conversion
Yeah, should be fine in general. Commands will not be completed, but that's fine. We already have cases like this, e.g if a follower sends multiple commands to a leader before leader is able to process them. Leader will process and commit one, the rest will be discarded as not matching prerequisites. Normal append requests from a leader will be just accumulated until the server gets to them. They are already applied by a majority, so the should apply cleanly after conversion, once it is done. > > I'll have a go at it and post a follow-up. > Ack. Thanks! Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
