On 12/16/23 15:23, Frode Nordahl wrote:
> On Fri, Dec 15, 2023 at 11:53 PM Ilya Maximets <[email protected]> wrote:
>>
>> On 12/13/23 23:19, Frode Nordahl wrote:
>>> Since the topic of clustered database schema conversion issues was
>>> first brought up [0], there has been several performance
>>> improvements which has greatly improved the situation.
>>>
>>> However, we cannot get away from the fact that the process remains
>>> unpredictable, when faced with a cluster that has grown to a scale
>>> no longer supported by its configuration.
>>>
>>> The OVSDB Server is mostly single threaded, and for clustered
>>> databases the schema conversion is done server side in the main
>>> thread.  While conversion is ongoing the server will not service
>>> critical cluster communication, such as Raft elections.
>>>
>>> In an ideal world operators would regularly re-assess their
>>> deployment configurations with real data to ensure parameters such
>>> as the OVSDB Raft election-timer is appropriately configured for
>>> all cluster operations, including schema conversions.
>>>
>>> The reality is that this does not happen, and operators are instead
>>> discovering the hard facts of managing a OVSDB cluster at critical
>>> points in time such as during database upgrades.
>>>
>>> To alleviate this situation we introduce time keeping in the
>>> ovsdb_convert functions for clustered databases.
>>>
>>> A timeout is deduced from the configured election timer using a
>>> divisor of 4 to compute buffer time.  Buffer time is the time
>>> required for writing the result of the conversion to the raft log
>>> and initiating command to replace data on peers before an election
>>> is held for the next term.
>>>
>>> For conversions initiated by a client (such as ovsdb-client),
>>> the process will be aborted and an error will be returned if the
>>> schema conversion process overruns the timeout.  Conversions
>>> initiated from storage will not abort, a warning indicating the election
>>> timer being set too low will be logged instead.
>>>
>>> This will allow a human or automatic operator to take necessary
>>> action, such as increasing the election timer, before attempting
>>> the schema conversion again.
>>>
>>> 0: 
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052140.html
>>> Signed-off-by: Frode Nordahl <[email protected]>
>>> ---
>>
>> Hi, Frode.  Thanks for the patch!
> 
> Thank you for taking the time to review, much appreciated!
> 
>> It reminds me of the previous unfinished work from Anton:
>>   
>> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>>
>> Looks like the same general idea.  There are a few issues with this
>> kind of functionality though:
> 
> Thanks for the pointer to this attempt, I missed it.
> 
>> - The task is not finished, i.e. we have to give up and fail, which
>>   is not nice for the user.
>>
>> - Leaky abstraction.  This patch requires ovsdb/file module to know
>>   about RAFT, which is largely irrelevant to that module.  Can be fixed
>>   by handing off the configuration from top to bottom (see the Anton's
>>   implementation), but in this case will require all the intermediate
>>   layers to pass the information along, piercing the whole hierarchy.
>>
>> - The issue appears to not be fully addressed.  In my previous testing,
>>   IIRC, sending back all the initial monitor replies after conversion
>>   was a heavier operation than a conversion itself.  So, even if
>>   conversion doesn't time out, jsonrpc server may cause the timeout.
>>   So, we kind of need both parts of the puzzle.
> 
> While preparing the proposal I took into account the fact that we have
> to send the result of the conversion to our peers, and found it
> plausible that the peers would not start an election while being busy
> processing the append request.
> 
> I did not take into account the effect of disconnecting all the
> *clients*, which I presume you are referring to above, and the
> thundering herd effect we get there is indeed a very good point.
> 
>> - Also, only timing the conversion is a little arbitrary.  The timer/4
>>   is chosen, but there is no guarantee that 3/4 of the timer are enough
>>   for everything else to run.  Anton's solution was trying to keep track
>>   of time for every operation and subtract from the budget, but that
>>   becomes a trade off between how deep in the call stack we want to
>>   be able to yield (foreshadowing :) ) vs at how many levels we need to
>>   implement time budget tracking.
> 
> This is indeed the weakest point of the proposal. I found no simple
> way to determine the time required for the remaining tasks after the
> data processing.
> 
>>
>> So, I got another idea to discuss (It might have been mentioned somewhere
>> in the past years, but I do not remember).
>>
>> Can we make use of some cooperative multitasking here?
>> I'm not talking about full-fledged co-routines, but something simpler,
>> though similar in nature.
>>
>> Let's say we create a library with 2 functions:
>>
>> 1. cooperative_multitasking_register(callback, arg, time_threshold);
>> 2. cooperative_yield();
>>
>> Different modules could register a function in this module specifying
>> a how much time should pass between calls.  Other modules that are
>> expected to have heavy processing will call the yield() function at
>> points where they feel comfortable to do so.  The yield() function
>> may look through all the registered callbacks and call ones that wasn't
>> called for the duration of their time threshold.  Maybe we'll need
>> a third function for a callback to call to update the last used time.
>> Might be useful if the callback is called directly at some point and
>> not from the yeild().
>>
>> Some strict rules will be needed for such a library though.  i.e.
>> the module that registers itself with a callback must not use the
>> yield functionality inside nor it should be possible to do so via
>> calls to other modules.  Not sure if recursion detection is enough,
>> but surely needed.  And basically, the module that registers the
>> callback should be self-sufficient, i.e. the internal state of that
>> module should not matter to the outside world, at least it should
>> not matter for the call stack that enters the yield().
>> yield() must not be called from places that can loop indefinitely,
>> only in places that eventually end, otherwise it may give a false
>> impression that the server is working fine while it is stuck and
>> not actually doing any useful work.
>>
>> RAFT module though is very well isolated from the rest of the ovsdb
>> functionality, so it should be safe to call raft_run() or storage_run()
>> from almost anywhere else outside of the RAFT module itself.  Lower
>> layer libraries such as jsonrpc.c or ovsdb/log should not yield(),
>> since they can be used by the RAFT module.
>>
>> So, the RAFT module could register itself with a time threshold
>> derived from an election timer, and jsonrpc-server and conversion
>> may yield().
>>
>> Advantages of such a solution:
>>
>> - Less code, I hope. :)  No need for manual time tracking everywhere.
>>
>> - Easier to add yield() call to other places, if necessary.
>>
>> - A bit easier to track control flow.  Everything still runs to
>>   completion, no breaking out of the loops, no need to save states
>>   to revisit later.
> 
> Yes, I really like this part of the idea.
> 
> I have been thinking about some way of converting the ovsdb-server
> main loop to an asynchronous state machine to be able to deal with
> things like this, but as you touch on above it complicates every part
> of the implementation, making it less desirable.
> 
>> - No need to fail the conversion!
>>
>> - Unlike multi-threading, no need to worry much about data races.
> 
> Indeed async > multi-threading.
> 
>> - May be useful somewhere else?
> 
> Sounds generally useful to me.
> 
>> Disadvantages:
>>
>> - Need to be careful on which functions can yield() and which callbacks
>>   can be registered, to avoid modifications of data structures in use
>>   by the yielding function and the call chain above it.  E.g. we
>>   shouldn't modify lists that a yielding function iterates.
> 
> OTOH, this is no worse than the care we would have to take if we had
> wanted to go a threaded route. Not sure how much of this we can
> enforce in code, but we definitely need some rigor in reviewing
> anything including the use of this library.
> 
>> We have something similar in the python IDL code:
>>   
>> https://github.com/openvswitch/ovs/commit/d28c5ca57650d6866453d0adb9a2e048cda21a86
>> It's not intended for the python library itself, but for the
>> application that uses asyncio/eventlet to be able to run housekeeping
>> of these frameworks while IDL is processing large updates.
> 
> Thanks for the pointer.
> 
>> What do you think?  Does that make any sense?
> 
> Thank you for the counter proposal, and it does make a lot of sense to me.
> 
> The main worry I have while thinking about how this could play out is
> what would happen if raft_run() called from the ovsdb_convert yield()
> function processes append requests that would conflict with the
> on-going conversion. But I guess that would just lead to a retryable
> error condition reported back to the caller, which is fine, this is a
> distributed system after all. And it would most likely be acceptable
> to recommend limiting writes to the cluster while attempting a schema
> conversion

Yeah, should be fine in general.  Commands will not be completed,
but that's fine.  We already have cases like this, e.g if a follower
sends multiple commands to a leader before leader is able to process
them.  Leader will process and commit one, the rest will be discarded
as not matching prerequisites.

Normal append requests from a leader will be just accumulated until
the server gets to them.  They are already applied by a majority,
so the should apply cleanly after conversion, once it is done.

> 
> I'll have a go at it and post a follow-up.
> 

Ack.  Thanks!

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to