I am implementing Binlog GTID Indexes, to fix an old performance regression
for GTID when slaves connect to the master. I now have a solid design and a
working prototype, so I wanted to describe the work to encourage early
comments and suggestions.

I'm hoping this will make it to the next release (where is the place to
actively follow/participate in the release planning?)

Ever since I implemented GTID, connecting slaves needed to sequentially scan
the leading portion of one binlog file to locate the GTID position to start
at. This can be slow as binlog files default to 1GB in size, even worse so
when binlog files can now be encrypted. I files MDEV-4991 for this 10 years
ago, and it is embarrassing that this has somehow remained unfixed for so
long :-/ But now I'm fixing it.

The current implementation is in the work-in-progress branch
knielsen_mdev4991:

  https://github.com/MariaDB/server/tree/knielsen_mdev4991

The code is already functional, with the testsuite passing. Still missing is
binlog purge, index crash recovery, async path, and general testing and
cleanup.

The basic idea is for each binlog file master-bin.000001 to write an index
file master-bin.000001.idx. The index contains a B+-Tree in which the keys
are pairs of (GTID state, binlog offset). A connecting slave's GTID position
can be looked up quickly in the tree to find the corresponding binlog offset
(and gtid_binlog_state) to start replicating at. Similarly, a non-GTID
connecting slave can quickly look up its starting offset (BINLOG_GTID_POS())
to obtain the GTID position corresponding to its starting offset.

The index is written out to disk concurrently with the writing of the binlog
file. Since the tree is written in sorted order and append-only, the
implementation is significantly simplified over a general B+-Tree, and can
also be described as a Log-structed Merge Tree with a Btree search structure
on top.

All GTID state keys (except the first) in the index are delta-compressed,
storing only the GTIDs that changed since the last record (which typically
will be only one). Additionally the index is sparse, storing only say every
1-in-10 GTIDs in the index (this will be configurable of course). Using a
sparse index, the disk space is reduced at the cost only of scanning a few
extra events of the binlog file to find the exact position requested. Actual
benchmarking of space usage is TBO, but we can expect a typical binlog file
of 1 GB containing say 3 million GTIDs to require something like 10 MB of
index disk space (1% increase).

The file format is page-based (unlike the binlog file), so that it can be
written efficiently and easily read using random-access.

The index is written in parallel with the binlog file, but no extra fsync()s
are needed (except one at the end of the file when the index is closed and
synced to disk). The writing of the index will happen asynchroneously from
the binlog background thread; this way it will have minimal impact on the
performance and scalability of the binlog and the contested LOCK_log.
Connecting slaves can read from the currently-being-written ("hot") GTID
index by accessing internal memory buffers of pages not yet written to disk.

If the server crashes, the existing binlog recovery and binlog checkpoint
mechanism will be used to re-create any incomplete indexes as part of the
normal recovery scan of binlog files. If somehow a binlog index should be
found corrupt or missing (could be eg. old server upgrade or backup script
that omits the new .idx files), the code will gracefully fall back to the
old (slower) way of sequentially scanning the binlog file to locate the
slave starting position.

I think this will be a great (if rather late) improvement to one remaining
corner of the GTID implementation that currently has sub-optimal
performance, and I'm exited to get this completed and reviewed/tested and
added to hopefully the next MariaDB release.

 - Kristian.
_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

Reply via email to