On Mon, Mar 27, 2023 at 9:50 PM Ilya Maximets <[email protected]> wrote: > > On 1/23/23 11:44, Frode Nordahl wrote: > > On Tue, Jan 3, 2023 at 3:07 PM Ilya Maximets <[email protected]> wrote: > >> > >> On 12/14/22 08:28, Frode Nordahl via discuss wrote: > >>> Hello, > >>> > >>> When performing an online schema conversion for a clustered DB the > >>> `ovsdb-client` connects to the current leader of the cluster and > >>> requests it to convert the DB to a new schema. > >>> > >>> The main thread of the leader ovsdb-server will then parse the new > >>> schema and copy the entire database into a new in-memory copy using > >>> the new schema. For a moderately sized database, let's say 650MB > >>> on-disk, this process can take north of 24 seconds on a modern > >>> adequately performant system. > >>> > >>> While this is happening the ovsdb-server process will not process any > >>> raft election events or inactivity probes, so by the time the > >>> conversion is done and the now past leader wants to write the > >>> converted database to the cluster, its connection to the cluster is > >>> dead. > >>> > >>> The past leader will keep repeating this process indefinitely, until > >>> the client requesting the conversion disconnects. No message is passed > >>> to the client. > >>> > >>> Meanwhile the other nodes in the cluster have moved on with a new leader. > >>> > >>> A workaround for this scenario would be to increase the election timer > >>> to a value great enough so that the conversion can succeed within an > >>> election window. > >>> > >>> I don't view this as a permanent solution though, as it would be > >>> unfair to leave the end user with guessing the correct election timer > >>> in order for their upgrades to succeed. > >>> > >>> Maybe we need to hand off conversion to a thread and make the main > >>> loop only process raft requests until it is done, similar to the > >>> recent addition of preparing snapshot JSON in a separate thread [0]. > >>> > >>> Any other thoughts or ideas? > >>> > >>> 0: > >>> https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790 > >>> > >> > >> Hi, Frode. Thanks for starting this conversation. > > > > Thanks alot for your comprehensive response! > > > >> First of all I'd still respectfully disagree that 650 MB is a > >> moderately sized database. :) ovsdb-server on its own doesn't limit > >> users on how much data they can put in, but that doesn't mean there > >> is no limit at which it will be difficult for it or even impossible > >> to handle the database. From my experience 650 MB is far beyond the > >> threshold for a smooth work. > > > > I guess my comment about the moderate size was really targeted at the > > size of the deployment the DB came from. After looking more deeply > > into it, it is also clear that after a compaction the raw size of the > > file is around 305MB. > > > >> Allowing database to grow to such size might be considered a user > >> error, or a CMS error. In any case, setups should be tested at the > >> desired [simulated at least] scale including upgrades before > >> deploying in production environment to not run into such issues > >> unexpectedly. > > > > I do not disagree with this, part of the problem is that the ovn-ctl > > script currently defaults to attempting a schema upgrade on startup, > > which would mean that this procedure is executed whenever a user > > upgrades packages. > > > > The lack of visibility of the failed upgrade and the severe side > > effects a failed upgrade leads to, make me think we may need to change > > this. > > > > The reality is, sadly, that few users do full scale tests/simulations > > of their deployment to check whether the next upgrade would succeed. > > > >> Another way out from the situation, beside bumping the election > >> timer, might be to pin ovn-controllers, destroy the database (maybe > >> keep port bindings, etc.) and let northd to re-create it after > >> conversion. Not sure if that will actually work though, as I > >> didn't try. > > > > I believe for the incident that raised this topic to our attention the > > user has been unblocked by emptying the southbound DB and let northd > > re-create it. > > > >> For the threads, I'll re-iterate my thought that throwing more > >> cores on the problem is absolutely last thing we should do. Only > >> if there is no other choice. Simply because many parts of > >> ovsdb-server was never optimized for performance and there are > >> likely many things we can do to improve without blindly using more > >> resources and increasing the code complexity by adding threads. > > > > I admire your persistence in upholding this standard, and you are > > right, we should absolutely try every avenue before adding bloat and > > complexity to OVS/OVN, thank you for keeping the bar high! :) > > > > The main reason for thinking about a separate thread here was to avoid > > the DB cluster leader disappearing from the cluster until the client > > attempting the conversion disconnects. For the operator, the > > underlying reason for this happening is non-obvious. I guess we could > > handle that part with better messaging. > > > >> Speaking of not optimal code, the conversion process seems very > >> inefficient. Let's deconstruct it. (I'll skip the standalone > >> case, focusing on the clustered mode.) > >> > >> There are few main steps: > >> > >> 1. ovsdb_convert() - Creates a copy of a database converting > >> each column along the way and checks the constraints. > >> > >> 2. ovsdb_to_txn_json() - Converts the new database into a > >> transaction JSON object. > >> > >> 3. ovsdb_txn_propose_schema_change() - Writes the new schema > >> and the transaction JSON to the storage (RAFT). > >> > >> 4. ovsdb_destroy() - Copy of a database is destroyed. > >> > >> ----- > >> > >> 5. read_db()/parse_txn() - Reads the new schema and the > >> transaction JSON from the storage, replaces the current > >> database with an empty one and replays the transaction > >> that creates a new converted database. > >> > >> There is always a storage run between steps 4 and 5, so we generally > >> care only that steps 1-4 and the step 5 are below the election timer > >> threshold separately. > >> > >> > >> Now looking closer to the step 1, which is the most time consuming > >> step. It has two stages - data conversion and the transaction > >> check. Data conversion part makes sure that we're creating all > >> the rows in a new database with all the new columns and without > >> removed columns. It also makes sure that all the datum objects > >> are converted from the old column type to the new column type by > >> calling ovsdb_datum_convert() for every one of them. > >> > >> Datum conversion is a very heavy operation, because it involves > >> converting it to JSON and back. However, in vast majority of cases > >> column types do not change at all, and even if they do, it only > >> happens for a few columns, not for all of them. > >> > >> This part can be optimized by not converting columns if the type > >> didn't change, and just creating a shallow copy of the datum with > >> an ovsdb_datum_clone() instead. > >> > >> In my preliminary testing this saves about 70% of the time spent > >> in ovsdb_convert() and reduces the overall conversion time by > >> about 30%. Creation of a shallow copy also speeds up the database > >> destruction at step 4 and saves some memory. > >> > >> I'll post patches I have for this a bit later after cleaning > >> them up. > > > > This would indeed be a good improvement, the patches [0] look great. > > > > I took them for a spin on the problem database, and I unfortunately > > did not see the same level of improvement for conversion time there. > > Looking more deeply into why I see there are more than 700k records in > > the Logical_Flow table, more than 600k of those rows are for the > > `lr_in_arp_resolve` stage, with repeating rows for around 1500 unique > > mac addresses. The Logical_Flow table has 4 columns that have changed > > type since 20.03, but I guess that fact is probably less important > > than the sheer number of records present. > > > > As previously stated, the size of this deployment is not huge, I'll > > try to gain access to the NB DB too, and see if this is some sort of > > logical flow explosion bug that has already been fixed. If that is the > > case, it would also feed into a need for changing how schema upgrades > > are handled. > > > >> The next part of the ovsdb_convert() is ovsdb_txn_replay_commit() > >> that we can't really get rid of, because we have to perform all > >> the transaction checks including referential integrity checks that > >> take up most of the time. We might look at further optimizations > >> for the weak reference re-assessment process in the future though > >> as it seems to be the heaviest part. > >> > >> > >> > >> Now to the steps 2 and 3. > >> Creation of a transaction JSON and writing it to the storage seems > >> to be completely redundant. I understand that it was probably done > >> this way because conversion to JSON and back was cheaper than > >> ovsdb_convert(). However, with the change for a step 1, it will not > >> be the case anymore. > >> > >> Database conversion is a deterministic process. All we need to > >> perform it is the database content, which is already in the > >> storage, and the new schema. So, we should be able to drop the > >> step 2 and write only the new schema to the storage on step 3. > >> On step 5 we'll need to read the schema and call ovsdb_convert() > >> again, but that should not be heavier than parsing transaction from > >> JSON and replaying it, with the above optimizations. RAFT will > >> ensure that we're converting the same data as in the first > >> ovsdb_convert(). > >> > >> Writing only the new schema to the storage may mean a backward > >> incompatible database file format change, but that should not > >> be very hard to handle taking into account that it only happens > >> during conversion and the next compaction will get rid of > >> incompatibility. > >> > >> I'll sketch up some patches for this as well. Didn't try yet, > >> so don't have any performance data. > >> > >> > >> All in all the conversion process will be condensed to just two > >> calls to optimized ovsdb_convert() and two database_destroy(). > >> One before writing to storage and one after reading from it. > >> This should be enough to cover most of the use cases, I hope. > >> > >> Thoughts? > > > > Thanks again for the comprehensive explanation and suggestions for > > performance improvements. The performance increase you've seen for > > skipping unnecessary JSON conversions [0] is impressive. > > > > However, I'm not sure we can optimize away the entirety of the > > problem. For the end user, it is currently quite hard to identify what > > the problem is, and how to deal with it, when the schema upgrade > > fails. I'll have a look at this part of the equation too and propose > > some patches. > > > > 0: > > https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/ > > > > Just posted a patch set that should address remaining parts of the > proposed solution: > > https://patchwork.ozlabs.org/project/openvswitch/cover/[email protected]/ > > I also uncovered a few other issues along the way and tried to fix > most of them as well.
That's great, thank you for working on this, I'll take a look and throw some real life past problematic upgrade DBs at it. -- Frode Nordahl > Best regards, Ilya Maximets. _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
