[MariaDB developers] Re: Update on MDEV-34705 implementing binlog in InnoDB

Markus Mäkelä via developers Wed, 04 Dec 2024 10:24:12 -0800

Hi,

On 12/4/24 18:08, Kristian Nielsen wrote:

Markus Mäkelä via developers <developers@lists.mariadb.org> writes:

On 12/4/24 13:19, Kristian Nielsen via developers wrote:

5. A more controversial thought is to drop support for semi-sync
replication. I think many users use semi-sync believing it does something

As a (kind of) user of semi-sync replication, I believe it has a

Hi Markus, thanks for taking the time to comment! Your input is very
valuable.

valid, albeit limited, use-case and that it's a necessary component in
setups where no transactions are allowed to be lost when the primary
node in a replication cluster goes down.  Perhaps I'm wrong or the way

I would like to be explicit about what it means "no transactions are allowed
to be lost". I know you Markus fully understand what it means, of course.

Transactions can easily be lost if the server crashes up to and during the
commit. What it really means is that the server will send a notification to
the client at some point when a single point of failure will no longer cause
the transaction to be lost. With semi-sync, this notification comes in the
form of the "ok" result of the client's commit.

I want to understand if there are other, possibly better ways to get this
notification, if that is all the relevant applications need?

I was suggesting that the application could itself use MASTER_GTID_WAIT()
against a slave before accepting the commit as "ok" (or a proxy like
MaxScale could do it for the application). Does the current semi-sync
replication do anything more for the application than this, and if so, what?

I think that implementing semi-sync in each application is probably abit too much but doing it in a proxy like MaxScale does sound doable andthe implementation would be essentially the same: delay the OK for thecommit until at least one replica responds to the MASTER_GTID_WAIT().The number of roundtrips should be the same so the only downside of thisapproach is that you're forced to wait for the SQL thread to apply thetransaction which introduces more latency than the existing semi-syncapproach does. If a function like MASTER_GTID_WAIT_FOR_IO_THREAD() wereto exist, it would be probably be very close in terms of latency.

Another use-case that I think I heard about was to use semi-syncreplication to slow down the rate of writes so that replication lag isavoided. While this is possible, I believe that tuning the group commitsize to be larger probably has the same effect with better overallperformance.

One benefit of this method is that each commit can decide whether it needs
to wait or not. One commit that "is not allowed to be lost" will not block
other transactions from committing. I think with AFTER_SYNC, all following
transactions will be blocked from committing until the current commit has
been acknowledged by a slave, and that with AFTER_COMMIT they will not be
blocked, but I'm not 100% sure.

I had a vague memory of the group commit mechanism doing only one ACKper group but I might have remembered it wrong, I'm mostly a passiveobserver to all replication related discussion in Zulip and MDEVs. If itindeed does one ACK per commit even if there's a group of transactionsthen doing it at the application level might potentially perform betteras the waits could be done in parallel.

misunderstanding comes from this. The default value of
rpl_semi_sync_master_wait_point should be AFTER_SYNC (lossless
failover) and rpl_semi_sync_master_timeout should be set to something

I would like to understand the reason(s) AFTER_SYNC is better than
AFTER_COMMIT.

 From my understanding, from the client's narrow perspective about their own
commit there is little difference, either is a notification that the
transaction is now robust to single point of failure (available on at least
two servers).

Yes, I think you're right and from the point of view of the client theconfiguration is irrelevant: if you get the OK for the commit thetransaction is "durable" on more than one server.


I know of one usecase, which is when things are set up so that if the master
crashes, failover to a slave is _always_ done, and the crashed master is
changed to be a slave of the new master (as opposed to letting the master
restart, do crash recovery, and continue its operation as a master).

With AFTER_COMMIT, the old master might have a transaction committed that
does not exist on the new master, which will prevent it from working as a
slave and it will need to be discarded (possibly restored from a backup).

With AFTER_SYNC, the old master may still (after restarting) have a
transaction committed to the binlog that is not on the slave / new master.
But the old master can be restarted with --rpl-semi-sync-slave-enabled that
tries to truncate the binlog to discard as many transactions from it as
possible, to make sure it only has transactions that are also present on the
new master.

I think this is the use-case that MDEV-21117 and MDEV-33465 relate to.From what I remember (in relation to MDEV-33465) having the master rollback the transactions caused some problems to happen if a quick restarthappened. I think it was that if GTID 0-1-123 gets replicated due toAFTER_SYNC but then the master crashes and comes back up, it rolls back0-1-123 due to --rpl-semi-sync-slave-enabled (or --init-rpl-role=SLAVEafter MDEV-33465) but before replication starts back up, anothertransaction gets commited as GTID 0-1-123 on the master. Now when thereplication asks for "GTID position after 0-1-123", instead of getting a"I have not seen that GTID" error, the replication continues and historyeffectively got rewritten. I don't remember if this was the exactproblem but it was something along these lines. Looking at thedescription of --init-rpl-role(https://mariadb.com/kb/en/mariadbd-options/#-init-rpl-role), it seemsthat it can also cause replication to break.

(Interestingly, this means that the purpose of AFTER_SYNC is to ensure that
transactions _are_ lost, rather than ensure that they are _not_ lost).

Is this the (only) reason that AFTER_SYNC should be default? Or do you know
of other reasons to prefer it?

I think this is the only reason, I thought that it had a morefundamental effect on things but I think I must've remembered it only inrelation to failed masters rejoining the cluster. Due to MDEV-33465, Ithink that my initial thoughts on this are probably wrong and thedefault value probably isn't as important as I imagined it would be.


Now, with the new binlog implementation, there is no longer any AFTER_SYNC.
The whole point of the feature is to make the binlog commit and the InnoDB
commit atomic with each other as a whole, there is no point at which a
transaction is durably committed in the binlog and not committed in InnoDB.
So the truncation of the binlog at old master restart with
--rpl-semi-sync-slave-enabled no longer applies.

But I would argue that this binlog truncation is anyway a misfeature. If we
want to ensure that the master never commits a transaction before it has
been received by a slave, then send the transaction to the slave and await
slave reply _before_ writing it to the binlog. Don't first write it to the
binlog, and then add complex crash recovery code to try and remove it from
the binlog again.

And doing the semi-sync handshake _before_ writing the transaction to the
binlog is something that could be implemented in the new binlog
implementation. It would be something like BEFORE_WRITE, instead of
AFTER_SYNC (which does not exist in the new binlog implementation).

Thus, I really want to understand:

1. Is the --rpl-semi-sync-slave-enabled use case, where a crashing master is
always demoted to a slave, used by users in practice, to warrant
implementing something like BEFORE_WRITE semisync for the new binlog format?

From what I know and have seen, it is used when something fullyautomatic like MaxScale is used to handle failovers and rejoining ofnodes to the cluster. Without it, I think that you would eventually haveto start restoring the nodes from backups once enough failovers havehappened.

I think the bigger problems is that, until MDEV-34878 or somethingsimilar is implemented, there's now way for the crashed master to knowwhat its role in the cluster is as it depends on the other nodes in thecluster. If a failover did take place then the crashed master must comeback as a slave and try to rejoin the cluster. If no failover tookplace, the crashed master must come back as a master and continueaccepting writes.

Since --init-rpl-role=MASTER cannot be set at runtime, the safest thingto do is to live with the consequences and accept the fact that youcan't always rejoin the crashed master back into the cluster.


2. Is there another reason that AFTER_SYNC is useful that I should know, and
which needs to be designed into the new binlog format?

  - Kristian.


--
Markus Mäkelä, Senior Software Engineer
MariaDB Corporation

_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

[MariaDB developers] Re: Update on MDEV-34705 implementing binlog in InnoDB

Reply via email to