Kristian, Sergey, hello. While I'm still studying the Mariadb BGC (binlog group commit) let me jump in with few comments.
> Sergey Petrunia <[email protected]> writes: > >> == Some background == >> >> "group commit with binlog" feature needs to accomplish two goals: >> >> 1. Keep the binlog and the storage engine in sync. >> storage_engine->prepare(sync=true); >> binlog->write(sync=true); >> storage_engine->commit(); >> >> 2. The second goal is to make operation performant. We need two coordinated >> disk > >> == Group Commit with Binlog in MySQL == >> >> MySQL (and fb/mysql-5.6 in particular) does in the following phases: >> >> Phase #1: >> Call storage_engine->prepare() for all transactions in the group. >> The call itself is not persistent. >> >> Phase #2: Call storage->engine->flush_logs(). >> This makes the effect of all Prepare operations from Phase#1 persistent. >> >> Phase #3: >> Write and sync the binary log. >> >> Phase #4: >> Call storage_engine->commit(). This does not need to be persistent. > > Interesting. Phase #2 is a mysql 5.7 feature, it is not in 5.6. Did Facebook > backport this to their 5.6 tree? Or did MySQL 5.7 get this from Facebook > work? Facebook contributed, indeed. Bug #73202 write/sync redo log before flush thread cache to binlog Submitted: 5 Jul 2014 0:31 Modified: 24 Nov 2014 2:01 Reporter: zhai weixiang (OCA) > >> MariaDB does not have these phases described above: >> >>> Phase #1: >>> Call storage_engine->prepare() for all transactions in the group. >>> The call itself is not persistent. >>> >>> Phase #2: Call storage->engine->flush_logs(). >>> This makes the effect of all Prepare operations from Phase#1 persistent. > > Right, it combines them in a single "phase". storage_engine->prepare() is > expected to be persistent and do its own group prepare. While MySQL 5.7 > builds a list of transactions to group prepare. > >> RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc, >> DBImpl::WriteImpl function) handles both Prepare() and Commit() commands >> and does the following: >> >> 1. Controls writing the commited data into the MemTable >> 2. Writes transactions to WAL >> 3. Syncs the WAL. > > Can you explain, at a high level, how RocksDB transaction visibility, lock > release, and persistency works? > > Is it like - once a transaction is written to the MemTable, it is visible to > other transactions and its commit order is determined wrt. other > transactions? > > And persistency is guaranteed after write+sync of the WAL? > > When are locks released in this sequence? > >> All three steps are done for the whole group. This has a consequence: a >> Commit() operation that does not need to sync the WAL will still be delayed >> if another operation in the group needs the WAL to be synced. > > So do I understand correctly: DbImpl::WriteImpl() does both group > commit Sergey definitely needed to spell it out. One may even think of a SELECT that suffers delay.. A fair guess? > (in-memory) to the MemTable, as well as group commit (on disk) to the WAL? > And it uses the _same_ grouping of transactions for these two operations? > > And so, the first commit_ordered() joins some prepare() that wants to sync > the WAL. And only once that sync is done can the next commit_ordered() start > - and it might easily end up joining a thread that recently completed > its delayed commit_ordered() and is now doing prepare for a new transaction. > > Indeed, this is not acceptable performance-wise for commit_ordered(). It > must not wait for disk operations. > > So why is this not a problem in the MySQL case? MySQL runs the > handlerton->commit() calls under LOCK_commit just like MariaDB does > commit_ordered(). Sounds like 5.7 (5.6-fb) flush_logs() by the BGC leader triggers necessary group-prepare coordination. Specifically, I would expect (yet to look at the Sergey's branch) one fsync() for the whole group by the leader. > > My guess is that this is because in your tests, WriteImpl() was _always_ > called with WAL sync disabled. I wonder what would happen if you were to run > a mix of binlogged and non-binlogged (SET sql_log_bin=0) transactions, where > the latter would end up in WriteImpl(sync=true); maybe a similar problem > would occur. > > > So I think there is something interesting to look at here. If I understand > correctly, WriteImpl() tries to reduce contention between threads doing > commit (even in-memory commit) by making them group up and having a single > thread commit for multiple transactions, rather than jumping from one thread > to another for each commit. > > This is a valid technique, but it seems to fit badly with what MySQL and > MariaDB is doing. In MariaDB, commits are _already_ grouped and done from a > single thread. In MySQL, _both_ prepare and commits are so grouped from a > single thread (though I think one thread can do group prepare in parallel > with another doing group commit). > > So there seems to be an opportunity to simplify WriteImpl() for the MySQL > and MariaDB binlog case. If my understanding is correct, there will not be a > lot of writer grouping. > > (Of course for the non-binlog case, the situation is different). > > Maybe this comment is relevant here? > > // Requesting sync with concurrent_prepare_ is expected to be very rare. > We > // hance provide a simple implementation that is not necessarily > efficient. > > >> == Possible solutions == >> >> I am not sure what to do. >> >> - Make the SQL layer's Group Commit implementation invoke hton->flush_logs() >> explicitly, like MySQL does? > > That is an option, though I hope you will not do it like that. What does > flush_logs() have to do with making prepare() persistent? > > You could instead add a new handlerton method group_prepare() or something. > If non-NULL, the storage engine may omit persistency in prepare(), but must > then in group_prepare() ensure that all prepares are persistent that have > completed prepare_ordered(). And if the method is non-null, the SQL layer > will call group_prepare() just before binlog write (under a new mutex > LOCK_group_prepare that is chained before LOCK_log. > > This way, the extra lock can be avoided for storage engines that do not need > group_prepare(). And storage engines have freedom to implement > group_prepare() in a way that suites them. > > Of course, Rocksdb can just implement group_prepare() as flush_logs() to > make _all_ prepares persistent, just like in MySQL. So for rocksdb the > functionality is identical to in MySQL, while flexibility is preserved for > other storage engines. Indeed. > > However, it still seems to me that there is opportunity to do better here. > For example, the upper layer could present to Rocksdb the actual list of > transactions that need to be group prepared / group committed. Then Rocksdb > could do them in a single writer without having to coordinate the threads > manually in WriteImpl(). > >> - Modify RocksDB so that Transaction::Commit(sync=false) do not use Group >> Write? I am not sure if this is possible: Group Write is not about only >> performance, it's about preventing concurrent MemTable writes. >> AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not >> participate in write groups and work as if there was no other activity. > > What about if rocksdb got the list of transactions to commit > (to memtable, sync=false) explicitly, rather than as individual commit() or > commit_ordered() calls? Then it could commit them all in a single writer, > which should be more efficient. And similar for prepare? > > In the current MySQL (facebook patch) code, isn't it the case that each > commit() has to create a new writer and write to memtable a single commit > individually? While all of these calls are in fact from a single thread from > an explicit, known list. This does not seem optimal. > >> - Modify RocksDB so that Transaction::Commit(sync=false) does not wait until >> its write group finishes WAL sync? This could be doable but is potentially >> complex. > > That should probably be done so that a write group would only write to the > memtable (and the in-memory WAL buffer?). After that, it would release all > non-syncing participants, and the remaining syncing participants could form > a new write group to do the sync independently. > > Of course, if a write group rarely syncs, this is of little benefit. From my > limited understanding of the code, flush_logs() which ends up in SyncWAL() > does not use a write group. > > > So this already ended up as a huge email, but I thought some background on > commit_ordered() could also help here. Note that in MySQL >=5.6, their > commit() is very similar to commit_ordered(). > > commit_ordered exists for three main purposes, if I recall correctly: > > 1. To synchronise commit order in binlog and storage engine. This ensures > that if a physical backup is taken of the storage engine and used to > provision a slave, the storage engine state corresponds to a unique point > in the binlog (MySQL has this). > > 2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly > synchronise snapshots between multiple storage engines (MySQL does not have > this, I think). (Offtopic, but anyway what it is? Multi-engine transaction with this specific ISOLATION level?) > > 3. To avoid having to do an extra fsync() for every commit, on top of the > one for prepare and the one for binlog write (MySQL has this). > > I think those are the main reason for commit_ordered() (I might have > forgotten some). > > For this problem, I suppose (3) is the main interest? > > MySQL handles (3) by stopping all transactions around binlog rotate and > doing a flush_logs(). Maybe I am missing something, but why binlog rotation? It is not a common case. Indeed MySQL BGC reduces the number of fsync() to two by the (flush stage) leader. As to rotation, it's a specific branch of MYSQL_BIN_LOG::ordered_commit() where a rotator thread contends for the flush stage mutex, eventually gains it (which may lead to few more groups binlogged into the old being rotated file) and performs. > It needs this because after binlog rotation, binlog > crash recovery has only an empty binlog, so _all_ transactions must be > durably committed at this point. > > MariaDB avoids this stall You must be having a use case which I can't see.. > around binlog rotate. Instead it extends binlog > crash recovery to be able to look into multiple binlog files. So there is no > need to force commits to disk around binlog rotate. > > To eventually be able to drop binlog files, there is the > binlog_checkpoint_request() mechanism. This allows the storage engine to > inform the upper layer when all the transactions in a binlog have ended up > durably committed, in the normal course of action of the storage engine. > > So to just get (3), RocksDB could just implement no commit_ordered(), or > perhaps an empty commit_ordered(). And then also implement > binlog_checkpoint_request() to record the latest prepared transaction at > that point - and when the WAL is later synced, reply back to the upper layer > to allow it to release the old binlog file. This seems doable without > support for quickly committing a transaction to memory, which current > RocksDB WriteImpl() seems poor at doing simultaneously with persistent > prepare(). > > Though I think it would be good if the full functionality of > commit_ordered() was implemented in RocksDB. Passing down into RocksDB > explicitly the list of transactions to group-prepare or group-commit sounds > like an interesting idea that could potentially benefit performance. > > Thoughts? Let me reserve my night for more :-). > > Hope this helps, I wanted to present some background on this feature. Please > let me know of any details you want explained or discussed, and I will try > to answer them briefly and to the point. > > - Kristian. Cheers, Andrei _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : [email protected] Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp

