On Fri, Dec 15, 2023 at 11:53 PM Ilya Maximets <[email protected]> wrote:
>
> On 12/13/23 23:19, Frode Nordahl wrote:
> > Since the topic of clustered database schema conversion issues was
> > first brought up [0], there has been several performance
> > improvements which has greatly improved the situation.
> >
> > However, we cannot get away from the fact that the process remains
> > unpredictable, when faced with a cluster that has grown to a scale
> > no longer supported by its configuration.
> >
> > The OVSDB Server is mostly single threaded, and for clustered
> > databases the schema conversion is done server side in the main
> > thread.  While conversion is ongoing the server will not service
> > critical cluster communication, such as Raft elections.
> >
> > In an ideal world operators would regularly re-assess their
> > deployment configurations with real data to ensure parameters such
> > as the OVSDB Raft election-timer is appropriately configured for
> > all cluster operations, including schema conversions.
> >
> > The reality is that this does not happen, and operators are instead
> > discovering the hard facts of managing a OVSDB cluster at critical
> > points in time such as during database upgrades.
> >
> > To alleviate this situation we introduce time keeping in the
> > ovsdb_convert functions for clustered databases.
> >
> > A timeout is deduced from the configured election timer using a
> > divisor of 4 to compute buffer time.  Buffer time is the time
> > required for writing the result of the conversion to the raft log
> > and initiating command to replace data on peers before an election
> > is held for the next term.
> >
> > For conversions initiated by a client (such as ovsdb-client),
> > the process will be aborted and an error will be returned if the
> > schema conversion process overruns the timeout.  Conversions
> > initiated from storage will not abort, a warning indicating the election
> > timer being set too low will be logged instead.
> >
> > This will allow a human or automatic operator to take necessary
> > action, such as increasing the election timer, before attempting
> > the schema conversion again.
> >
> > 0: 
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052140.html
> > Signed-off-by: Frode Nordahl <[email protected]>
> > ---
>
> Hi, Frode.  Thanks for the patch!

Thank you for taking the time to review, much appreciated!

> It reminds me of the previous unfinished work from Anton:
>   
> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
>
> Looks like the same general idea.  There are a few issues with this
> kind of functionality though:

Thanks for the pointer to this attempt, I missed it.

> - The task is not finished, i.e. we have to give up and fail, which
>   is not nice for the user.
>
> - Leaky abstraction.  This patch requires ovsdb/file module to know
>   about RAFT, which is largely irrelevant to that module.  Can be fixed
>   by handing off the configuration from top to bottom (see the Anton's
>   implementation), but in this case will require all the intermediate
>   layers to pass the information along, piercing the whole hierarchy.
>
> - The issue appears to not be fully addressed.  In my previous testing,
>   IIRC, sending back all the initial monitor replies after conversion
>   was a heavier operation than a conversion itself.  So, even if
>   conversion doesn't time out, jsonrpc server may cause the timeout.
>   So, we kind of need both parts of the puzzle.

While preparing the proposal I took into account the fact that we have
to send the result of the conversion to our peers, and found it
plausible that the peers would not start an election while being busy
processing the append request.

I did not take into account the effect of disconnecting all the
*clients*, which I presume you are referring to above, and the
thundering herd effect we get there is indeed a very good point.

> - Also, only timing the conversion is a little arbitrary.  The timer/4
>   is chosen, but there is no guarantee that 3/4 of the timer are enough
>   for everything else to run.  Anton's solution was trying to keep track
>   of time for every operation and subtract from the budget, but that
>   becomes a trade off between how deep in the call stack we want to
>   be able to yield (foreshadowing :) ) vs at how many levels we need to
>   implement time budget tracking.

This is indeed the weakest point of the proposal. I found no simple
way to determine the time required for the remaining tasks after the
data processing.

>
> So, I got another idea to discuss (It might have been mentioned somewhere
> in the past years, but I do not remember).
>
> Can we make use of some cooperative multitasking here?
> I'm not talking about full-fledged co-routines, but something simpler,
> though similar in nature.
>
> Let's say we create a library with 2 functions:
>
> 1. cooperative_multitasking_register(callback, arg, time_threshold);
> 2. cooperative_yield();
>
> Different modules could register a function in this module specifying
> a how much time should pass between calls.  Other modules that are
> expected to have heavy processing will call the yield() function at
> points where they feel comfortable to do so.  The yield() function
> may look through all the registered callbacks and call ones that wasn't
> called for the duration of their time threshold.  Maybe we'll need
> a third function for a callback to call to update the last used time.
> Might be useful if the callback is called directly at some point and
> not from the yeild().
>
> Some strict rules will be needed for such a library though.  i.e.
> the module that registers itself with a callback must not use the
> yield functionality inside nor it should be possible to do so via
> calls to other modules.  Not sure if recursion detection is enough,
> but surely needed.  And basically, the module that registers the
> callback should be self-sufficient, i.e. the internal state of that
> module should not matter to the outside world, at least it should
> not matter for the call stack that enters the yield().
> yield() must not be called from places that can loop indefinitely,
> only in places that eventually end, otherwise it may give a false
> impression that the server is working fine while it is stuck and
> not actually doing any useful work.
>
> RAFT module though is very well isolated from the rest of the ovsdb
> functionality, so it should be safe to call raft_run() or storage_run()
> from almost anywhere else outside of the RAFT module itself.  Lower
> layer libraries such as jsonrpc.c or ovsdb/log should not yield(),
> since they can be used by the RAFT module.
>
> So, the RAFT module could register itself with a time threshold
> derived from an election timer, and jsonrpc-server and conversion
> may yield().
>
> Advantages of such a solution:
>
> - Less code, I hope. :)  No need for manual time tracking everywhere.
>
> - Easier to add yield() call to other places, if necessary.
>
> - A bit easier to track control flow.  Everything still runs to
>   completion, no breaking out of the loops, no need to save states
>   to revisit later.

Yes, I really like this part of the idea.

I have been thinking about some way of converting the ovsdb-server
main loop to an asynchronous state machine to be able to deal with
things like this, but as you touch on above it complicates every part
of the implementation, making it less desirable.

> - No need to fail the conversion!
>
> - Unlike multi-threading, no need to worry much about data races.

Indeed async > multi-threading.

> - May be useful somewhere else?

Sounds generally useful to me.

> Disadvantages:
>
> - Need to be careful on which functions can yield() and which callbacks
>   can be registered, to avoid modifications of data structures in use
>   by the yielding function and the call chain above it.  E.g. we
>   shouldn't modify lists that a yielding function iterates.

OTOH, this is no worse than the care we would have to take if we had
wanted to go a threaded route. Not sure how much of this we can
enforce in code, but we definitely need some rigor in reviewing
anything including the use of this library.

> We have something similar in the python IDL code:
>   
> https://github.com/openvswitch/ovs/commit/d28c5ca57650d6866453d0adb9a2e048cda21a86
> It's not intended for the python library itself, but for the
> application that uses asyncio/eventlet to be able to run housekeeping
> of these frameworks while IDL is processing large updates.

Thanks for the pointer.

> What do you think?  Does that make any sense?

Thank you for the counter proposal, and it does make a lot of sense to me.

The main worry I have while thinking about how this could play out is
what would happen if raft_run() called from the ovsdb_convert yield()
function processes append requests that would conflict with the
on-going conversion. But I guess that would just lead to a retryable
error condition reported back to the caller, which is fine, this is a
distributed system after all. And it would most likely be acceptable
to recommend limiting writes to the cluster while attempting a schema
conversion

I'll have a go at it and post a follow-up.

-- 
Frode Nordahl

> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to