On 7/30/25 12:04 PM, Felix Huettner via dev wrote:
> The ovsdb database is regularly compacted if it exceeds a amount of data
> growth. The compaction reduces the size of the database by bundling all 
> already
> commited transactions in a single large initial transaction. This allows the 
> DB
> to get rid of all changes that happened to the data in the mean time.
> 
> To ensure consistency only a part of the compaction could happen
> asynchronously. Some parts (like writing the actual new file) needed to happen
> afterwards in the main thread to ensure no new transactions would interfere
> with the state. As this takes some time (especially for larger databases) in a
> raft cluster the leader would give up its leadership to ensure transactions 
> can
> still be processed.
> 
> However nothing guarantees that the new leader will not do a compaction a few
> seconds later as well an pass the leadership around again. This mean in the
> worst case for a N node cluster there could be N leadership changes. The
> average of leadership changes would probably be N/2.

IIRC, Terry had an idea to add a cool down period for leadership transfers
related to compactions, i.e. to block the transfer if the server became a
leader just recently.  This may be a good solution for multiple transfers
happening back to back.  Maybe the randomization of the snapshot scheduling
can also be improved, though it supposed to be already decent. 

> Such leadership changes are not free. The raft cluster is unable to process
> transactions until a new leader is elected (which is generally quite fast in
> leadership transfers). In addition all clients that need to connect to the
> leader will need to search for the new leader and reconnect to it.

In general, the only process that needs a leader-only connection in a typical
OVN setup is northd, because it requires locking.  Everything else should be
able to connect to followers just fine.  Performance of sending transactions
through a follower is not much different from sending though a leader in 3.4+.

I wonder if we can somehow locking mechanism to be distributed, but it seems
very complicated.

> The idea of this patch series it to attempt to minimize the time a compaction
> blocks the main thread. By having the compaction time small enough there is no
> need anymore to change the raft leadership. In addition this minimizes the
> downtime on non-raft systems as well.
> 
> To solve the consistency needs we split the ovsdb database into multiple 
> files.
> Only one of these files is ever written to for new transactions. When doing a
> compaction only the other files, that are not written to are compacted. Since
> they are no longer written to, there is no potential consistency issue anymore
> and we have an immutable state for the compaction to run on.
> 
> In normal operations we therefor have 2 files:
> 1. The base file. This is also what is passed in as the file name in the CLI.
>    It is immutable and generally contains the last compacted status, or the
>    initial state for a new cluster. It points to the second file by using
>    log metadata.
> 2. The current file. All operations are written to this file just as normal.
> 
> During compaction we can now:
> 1. Create a new file to write new transactions to
> 2. Write a pointer in the current file to point to the new file
> 3. Open a temporary compaction file
> 4. Spawn the asynchronous compaction to operate on the current state. So
>    everything in the base and current file. Changes to the new file will be
>    ignored by the compaction. The compaction writes the whole output to the
>    temporary file.
> 5. Wait for the compaction to complete and process transactions normally by
>    writing to the new file
> 6. When the compaction completes:
>     a. Write a pointer to the new file into the temporary compaction file
>     b. rename the temporary compaction file to the base file (atomically)
>     c. the temporary compaction file is now the new base file
>     d. the new file is now the current file
>     e. We can get rid of the previous base and current file as they are no
>        longer necessary
>     
> With this the main thread at most needs to write pointers to new files and
> rename files while all the work is done in the async compaction thread. This
> works for raft as well as for a standalone ovsdb.

Note: nothing prevents us from performing compaction today in a separate
thread for standalone databases, we just need to remember the file offsets
instead of transaction IDs and then copy the newly added data from the old
file to the new one.  This should not be too expensive.

> This also means that leader
> changes are no longer necessary.

I don't think I agree on that.  Your testing numbers below still show more
than a second where no transaction can be committed.  When I initially
implemented the compaction thread I had similar results with my testing
(1+ second of blockage, though on a slightly smaller database) and that was
the reason to keep the transfer.

> 
> Note that this is not affecting raft snapshot installation in any way. That
> still blocks the main thread for some time. In addition snapshot installation
> is treated as more important than compaction and compaction will be aborted if
> a snapshot installation is requested.
>     
> I tested the performance improvements with a 833MB database (after compaction
> 683 MB) as reported by the log output (compiled with -O2).
> 
> |                         | Before Change | After Change |
> |-------------------------|---------------|--------------|
> | Init                    | 1015ms        | 1046ms       |
> | Write/Commit            | 2869ms        | 137ms        |
> | Total Compaction Thread | 29407ms       | 32104ms      |
> | Total Main Thread       | 3884ms        | 1183ms       |
> 
> The 1 second in "Init" is nearly completely do to cloning the database for the
> compaction thread. If someone has an idea how to optimize this even further i
> would be quite interested.
> 
> But from my perspective i see a blocking of the main thread by ~1.2s as more
> optimal than a raft leader change, since that will cause a longer downtime for
> all connected services.

1 second is still a lot.  Vast majority of connected services should not be
using leader-only connections (it may be good to change the default).  The
reconnection itself should also be done without a backoff, so should also not
take too long.

There are a few issues with the implementation of this set as well:

- Cirrus CI is failing because some tests are using bash.  A few ovsdb log
  tests are also failing for some other reason, but I didn't look too deep.

- We can't destroy the database clone or the schema from the thread.
  The database is a shallow copy of the data and all of the reference
  counting is not atomic.  And making it atomic would not be an easy task,
  as it will require either un-inlining certain functions or exposing
  entire compiler/ovs-atomic libraries, which is not good.  Also, making
  them atomic may cause noticeable performance impact on systems with
  weaker memory ordering.

  It's also dangerous to allow the thread to use storage/raft modules as we
  can easily loose count of which parts of these mudules are safe and which
  are not.  I didn't go through all the paths in this set, but a requirement
  to always keeping this in mind when working on ovsdb changes doesn't sound
  particularly exciting...

- pthread_kill only ensures that the signal handler will be executed in
  the context of the target thread.  But SIGKILL will still kill the entire
  process.  There is no way to kill a single thread.  The termination has
  to be done manually with a proper return from the thread function by
  the thread itself.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to