Hi,
On 12/4/24 18:08, Kristian Nielsen wrote:
Markus Mäkelä via developers <developers@lists.mariadb.org> writes:
On 12/4/24 13:19, Kristian Nielsen via developers wrote:
5. A more controversial thought is to drop support for semi-sync
replication. I think many users use semi-sync believing it does something
As a (kind of) user of semi-sync replication, I believe it has a
Hi Markus, thanks for taking the time to comment! Your input is very
valuable.
valid, albeit limited, use-case and that it's a necessary component in
setups where no transactions are allowed to be lost when the primary
node in a replication cluster goes down. Perhaps I'm wrong or the way
I would like to be explicit about what it means "no transactions are allowed
to be lost". I know you Markus fully understand what it means, of course.
Transactions can easily be lost if the server crashes up to and during the
commit. What it really means is that the server will send a notification to
the client at some point when a single point of failure will no longer cause
the transaction to be lost. With semi-sync, this notification comes in the
form of the "ok" result of the client's commit.
I want to understand if there are other, possibly better ways to get this
notification, if that is all the relevant applications need?
I was suggesting that the application could itself use MASTER_GTID_WAIT()
against a slave before accepting the commit as "ok" (or a proxy like
MaxScale could do it for the application). Does the current semi-sync
replication do anything more for the application than this, and if so, what?
I think that implementing semi-sync in each application is probably a
bit too much but doing it in a proxy like MaxScale does sound doable and
the implementation would be essentially the same: delay the OK for the
commit until at least one replica responds to the MASTER_GTID_WAIT().
The number of roundtrips should be the same so the only downside of this
approach is that you're forced to wait for the SQL thread to apply the
transaction which introduces more latency than the existing semi-sync
approach does. If a function like MASTER_GTID_WAIT_FOR_IO_THREAD() were
to exist, it would be probably be very close in terms of latency.
Another use-case that I think I heard about was to use semi-sync
replication to slow down the rate of writes so that replication lag is
avoided. While this is possible, I believe that tuning the group commit
size to be larger probably has the same effect with better overall
performance.
One benefit of this method is that each commit can decide whether it needs
to wait or not. One commit that "is not allowed to be lost" will not block
other transactions from committing. I think with AFTER_SYNC, all following
transactions will be blocked from committing until the current commit has
been acknowledged by a slave, and that with AFTER_COMMIT they will not be
blocked, but I'm not 100% sure.
I had a vague memory of the group commit mechanism doing only one ACK
per group but I might have remembered it wrong, I'm mostly a passive
observer to all replication related discussion in Zulip and MDEVs. If it
indeed does one ACK per commit even if there's a group of transactions
then doing it at the application level might potentially perform better
as the waits could be done in parallel.
misunderstanding comes from this. The default value of
rpl_semi_sync_master_wait_point should be AFTER_SYNC (lossless
failover) and rpl_semi_sync_master_timeout should be set to something
I would like to understand the reason(s) AFTER_SYNC is better than
AFTER_COMMIT.
From my understanding, from the client's narrow perspective about their own
commit there is little difference, either is a notification that the
transaction is now robust to single point of failure (available on at least
two servers).
Yes, I think you're right and from the point of view of the client the
configuration is irrelevant: if you get the OK for the commit the
transaction is "durable" on more than one server.
I know of one usecase, which is when things are set up so that if the master
crashes, failover to a slave is _always_ done, and the crashed master is
changed to be a slave of the new master (as opposed to letting the master
restart, do crash recovery, and continue its operation as a master).
With AFTER_COMMIT, the old master might have a transaction committed that
does not exist on the new master, which will prevent it from working as a
slave and it will need to be discarded (possibly restored from a backup).
With AFTER_SYNC, the old master may still (after restarting) have a
transaction committed to the binlog that is not on the slave / new master.
But the old master can be restarted with --rpl-semi-sync-slave-enabled that
tries to truncate the binlog to discard as many transactions from it as
possible, to make sure it only has transactions that are also present on the
new master.
I think this is the use-case that MDEV-21117 and MDEV-33465 relate to.
From what I remember (in relation to MDEV-33465) having the master roll
back the transactions caused some problems to happen if a quick restart
happened. I think it was that if GTID 0-1-123 gets replicated due to
AFTER_SYNC but then the master crashes and comes back up, it rolls back
0-1-123 due to --rpl-semi-sync-slave-enabled (or --init-rpl-role=SLAVE
after MDEV-33465) but before replication starts back up, another
transaction gets commited as GTID 0-1-123 on the master. Now when the
replication asks for "GTID position after 0-1-123", instead of getting a
"I have not seen that GTID" error, the replication continues and history
effectively got rewritten. I don't remember if this was the exact
problem but it was something along these lines. Looking at the
description of --init-rpl-role
(https://mariadb.com/kb/en/mariadbd-options/#-init-rpl-role), it seems
that it can also cause replication to break.
(Interestingly, this means that the purpose of AFTER_SYNC is to ensure that
transactions _are_ lost, rather than ensure that they are _not_ lost).
Is this the (only) reason that AFTER_SYNC should be default? Or do you know
of other reasons to prefer it?
I think this is the only reason, I thought that it had a more
fundamental effect on things but I think I must've remembered it only in
relation to failed masters rejoining the cluster. Due to MDEV-33465, I
think that my initial thoughts on this are probably wrong and the
default value probably isn't as important as I imagined it would be.
Now, with the new binlog implementation, there is no longer any AFTER_SYNC.
The whole point of the feature is to make the binlog commit and the InnoDB
commit atomic with each other as a whole, there is no point at which a
transaction is durably committed in the binlog and not committed in InnoDB.
So the truncation of the binlog at old master restart with
--rpl-semi-sync-slave-enabled no longer applies.
But I would argue that this binlog truncation is anyway a misfeature. If we
want to ensure that the master never commits a transaction before it has
been received by a slave, then send the transaction to the slave and await
slave reply _before_ writing it to the binlog. Don't first write it to the
binlog, and then add complex crash recovery code to try and remove it from
the binlog again.
And doing the semi-sync handshake _before_ writing the transaction to the
binlog is something that could be implemented in the new binlog
implementation. It would be something like BEFORE_WRITE, instead of
AFTER_SYNC (which does not exist in the new binlog implementation).
Thus, I really want to understand:
1. Is the --rpl-semi-sync-slave-enabled use case, where a crashing master is
always demoted to a slave, used by users in practice, to warrant
implementing something like BEFORE_WRITE semisync for the new binlog format?
From what I know and have seen, it is used when something fully
automatic like MaxScale is used to handle failovers and rejoining of
nodes to the cluster. Without it, I think that you would eventually have
to start restoring the nodes from backups once enough failovers have
happened.
I think the bigger problems is that, until MDEV-34878 or something
similar is implemented, there's now way for the crashed master to know
what its role in the cluster is as it depends on the other nodes in the
cluster. If a failover did take place then the crashed master must come
back as a slave and try to rejoin the cluster. If no failover took
place, the crashed master must come back as a master and continue
accepting writes.
Since --init-rpl-role=MASTER cannot be set at runtime, the safest thing
to do is to live with the consequences and accept the fact that you
can't always rejoin the crashed master back into the cluster.
2. Is there another reason that AFTER_SYNC is useful that I should know, and
which needs to be designed into the new binlog format?
- Kristian.
--
Markus Mäkelä, Senior Software Engineer
MariaDB Corporation
_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org