Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

Frode Nordahl via discuss Mon, 27 Mar 2023 22:22:23 -0700

On Mon, Mar 27, 2023 at 9:50 PM Ilya Maximets <[email protected]> wrote:
>
> On 1/23/23 11:44, Frode Nordahl wrote:
> > On Tue, Jan 3, 2023 at 3:07 PM Ilya Maximets <[email protected]> wrote:
> >>
> >> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> >>> Hello,
> >>>
> >>> When performing an online schema conversion for a clustered DB the
> >>> `ovsdb-client` connects to the current leader of the cluster and
> >>> requests it to convert the DB to a new schema.
> >>>
> >>> The main thread of the leader ovsdb-server will then parse the new
> >>> schema and copy the entire database into a new in-memory copy using
> >>> the new schema. For a moderately sized database, let's say 650MB
> >>> on-disk, this process can take north of 24 seconds on a modern
> >>> adequately performant system.
> >>>
> >>> While this is happening the ovsdb-server process will not process any
> >>> raft election events or inactivity probes, so by the time the
> >>> conversion is done and the now past leader wants to write the
> >>> converted database to the cluster, its connection to the cluster is
> >>> dead.
> >>>
> >>> The past leader will keep repeating this process indefinitely, until
> >>> the client requesting the conversion disconnects. No message is passed
> >>> to the client.
> >>>
> >>> Meanwhile the other nodes in the cluster have moved on with a new leader.
> >>>
> >>> A workaround for this scenario would be to increase the election timer
> >>> to a value great enough so that the conversion can succeed within an
> >>> election window.
> >>>
> >>> I don't view this as a permanent solution though, as it would be
> >>> unfair to leave the end user with guessing the correct election timer
> >>> in order for their upgrades to succeed.
> >>>
> >>> Maybe we need to hand off conversion to a thread and make the main
> >>> loop only process raft requests until it is done, similar to the
> >>> recent addition of preparing snapshot JSON in a separate thread [0].
> >>>
> >>> Any other thoughts or ideas?
> >>>
> >>> 0: 
> >>> https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> >>>
> >>
> >> Hi, Frode.  Thanks for starting this conversation.
> >
> > Thanks alot for your comprehensive response!
> >
> >> First of all I'd still respectfully disagree that 650 MB is a
> >> moderately sized database. :)  ovsdb-server on its own doesn't limit
> >> users on how much data they can put in, but that doesn't mean there
> >> is no limit at which it will be difficult for it or even impossible
> >> to handle the database.  From my experience 650 MB is far beyond the
> >> threshold for a smooth work.
> >
> > I guess my comment about the moderate size was really targeted at the
> > size of the deployment the DB came from. After looking more deeply
> > into it, it is also clear that after a compaction the raw size of the
> > file is around 305MB.
> >
> >> Allowing database to grow to such size might be considered a user
> >> error, or a CMS error.  In any case, setups should be tested at the
> >> desired [simulated at least] scale including upgrades before
> >> deploying in production environment to not run into such issues
> >> unexpectedly.
> >
> > I do not disagree with this, part of the problem is that the ovn-ctl
> > script currently defaults to attempting a schema upgrade on startup,
> > which would mean that this procedure is executed whenever a user
> > upgrades packages.
> >
> > The lack of visibility of the failed upgrade and the severe side
> > effects a failed upgrade leads to, make me think we may need to change
> > this.
> >
> > The reality is, sadly, that few users do full scale tests/simulations
> > of their deployment to check whether the next upgrade would succeed.
> >
> >> Another way out from the situation, beside bumping the election
> >> timer, might be to pin ovn-controllers, destroy the database (maybe
> >> keep port bindings, etc.) and let northd to re-create it after
> >> conversion.  Not sure if that will actually work though, as I
> >> didn't try.
> >
> > I believe for the incident that raised this topic to our attention the
> > user has been unblocked by emptying the southbound DB and let northd
> > re-create it.
> >
> >> For the threads, I'll re-iterate my thought that throwing more
> >> cores on the problem is absolutely last thing we should do.  Only
> >> if there is no other choice.  Simply because many parts of
> >> ovsdb-server was never optimized for performance and there are
> >> likely many things we can do to improve without blindly using more
> >> resources and increasing the code complexity by adding threads.
> >
> > I admire your persistence in upholding this standard, and you are
> > right, we should absolutely try every avenue before adding bloat and
> > complexity to OVS/OVN, thank you for keeping the bar high! :)
> >
> > The main reason for thinking about a separate thread here was to avoid
> > the DB cluster leader disappearing from the cluster until the client
> > attempting the conversion disconnects. For the operator, the
> > underlying reason for this happening is non-obvious. I guess we could
> > handle that part with better messaging.
> >
> >> Speaking of not optimal code, the conversion process seems very
> >> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> >> case, focusing on the clustered mode.)
> >>
> >> There are few main steps:
> >>
> >> 1. ovsdb_convert() - Creates a copy of a database converting
> >>    each column along the way and checks the constraints.
> >>
> >> 2. ovsdb_to_txn_json() - Converts the new database into a
> >>    transaction JSON object.
> >>
> >> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
> >>    and the transaction JSON to the storage (RAFT).
> >>
> >> 4. ovsdb_destroy() - Copy of a database is destroyed.
> >>
> >>    -----
> >>
> >> 5. read_db()/parse_txn() - Reads the new schema and the
> >>    transaction JSON from the storage, replaces the current
> >>    database with an empty one and replays the transaction
> >>    that creates a new converted database.
> >>
> >> There is always a storage run between steps 4 and 5, so we generally
> >> care only that steps 1-4 and the step 5 are below the election timer
> >> threshold separately.
> >>
> >>
> >> Now looking closer to the step 1, which is the most time consuming
> >> step.  It has two stages - data conversion and the transaction
> >> check.  Data conversion part makes sure that we're creating all
> >> the rows in a new database with all the new columns and without
> >> removed columns.  It also makes sure that all the datum objects
> >> are converted from the old column type to the new column type by
> >> calling ovsdb_datum_convert() for every one of them.
> >>
> >> Datum conversion is a very heavy operation, because it involves
> >> converting it to JSON and back.  However, in vast majority of cases
> >> column types do not change at all, and even if they do, it only
> >> happens for a few columns, not for all of them.
> >>
> >> This part can be optimized by not converting columns if the type
> >> didn't change, and just creating a shallow copy of the datum with
> >> an ovsdb_datum_clone() instead.
> >>
> >> In my preliminary testing this saves about 70% of the time spent
> >> in ovsdb_convert() and reduces the overall conversion time by
> >> about 30%.  Creation of a shallow copy also speeds up the database
> >> destruction at step 4 and saves some memory.
> >>
> >> I'll post patches I have for this a bit later after cleaning
> >> them up.
> >
> > This would indeed be a good improvement, the patches [0] look great.
> >
> > I took them for a spin on the problem database, and I unfortunately
> > did not see the same level of improvement for conversion time there.
> > Looking more deeply into why I see there are more than 700k records in
> > the Logical_Flow table, more than 600k of those rows are for the
> > `lr_in_arp_resolve` stage, with repeating rows for around 1500 unique
> > mac addresses. The Logical_Flow table has 4 columns that have changed
> > type since 20.03, but I guess that fact is probably less important
> > than the sheer number of records present.
> >
> > As previously stated, the size of this deployment is not huge, I'll
> > try to gain access to the NB DB too, and see if this is some sort of
> > logical flow explosion bug that has already been fixed. If that is the
> > case, it would also feed into a need for changing how schema upgrades
> > are handled.
> >
> >> The next part of the ovsdb_convert() is ovsdb_txn_replay_commit()
> >> that we can't really get rid of, because we have to perform all
> >> the transaction checks including referential integrity checks that
> >> take up most of the time.  We might look at further optimizations
> >> for the weak reference re-assessment process in the future though
> >> as it seems to be the heaviest part.
> >>
> >>
> >>
> >> Now to the steps 2 and 3.
> >> Creation of a transaction JSON and writing it to the storage seems
> >> to be completely redundant.  I understand that it was probably done
> >> this way because conversion to JSON and back was cheaper than
> >> ovsdb_convert().  However, with the change for a step 1, it will not
> >> be the case anymore.
> >>
> >> Database conversion is a deterministic process.  All we need to
> >> perform it is the database content, which is already in the
> >> storage, and the new schema.  So, we should be able to drop the
> >> step 2 and write only the new schema to the storage on step 3.
> >> On step 5 we'll need to read the schema and call ovsdb_convert()
> >> again, but that should not be heavier than parsing transaction from
> >> JSON and replaying it, with the above optimizations.  RAFT will
> >> ensure that we're converting the same data as in the first
> >> ovsdb_convert().
> >>
> >> Writing only the new schema to the storage may mean a backward
> >> incompatible database file format change, but that should not
> >> be very hard to handle taking into account that it only happens
> >> during conversion and the next compaction will get rid of
> >> incompatibility.
> >>
> >> I'll sketch up some patches for this as well.   Didn't try yet,
> >> so don't have any performance data.
> >>
> >>
> >> All in all the conversion process will be condensed to just two
> >> calls to optimized ovsdb_convert() and two database_destroy().
> >> One before writing to storage and one after reading from it.
> >> This should be enough to cover most of the use cases, I hope.
> >>
> >> Thoughts?
> >
> > Thanks again for the comprehensive explanation and suggestions for
> > performance improvements. The performance increase you've seen for
> > skipping unnecessary JSON conversions [0] is impressive.
> >
> > However, I'm not sure we can optimize away the entirety of the
> > problem. For the end user, it is currently quite hard to identify what
> > the problem is, and how to deal with it, when the schema upgrade
> > fails. I'll have a look at this part of the equation too and propose
> > some patches.
> >
> > 0: 
> > https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
> >
>
> Just posted a patch set that should address remaining parts of the
> proposed solution:
>   
> https://patchwork.ozlabs.org/project/openvswitch/cover/[email protected]/
>
> I also uncovered a few other issues along the way and tried to fix
> most of them as well.


That's great, thank you for working on this, I'll take a look and
throw some real life past problematic upgrade DBs at it.

-- 
Frode Nordahl

> Best regards, Ilya Maximets.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

Reply via email to