Thanks a lot Markus for the additional explanations, very useful.

Markus Mäkelä via developers <developers@lists.mariadb.org> writes:

> From what I remember (in relation to MDEV-33465) having the master
> roll back the transactions caused some problems to happen if a quick
> restart happened. I think it was that if GTID 0-1-123 gets replicated
> due to AFTER_SYNC but then the master crashes and comes back up, it
> rolls back 0-1-123 due to --rpl-semi-sync-slave-enabled (or
> --init-rpl-role=SLAVE after MDEV-33465) but before replication starts
> back up, another transaction gets commited as GTID 0-1-123 on the
> master. Now when the replication asks for "GTID position after

Yes. _Either_ we need to be sure the master is ahead of all slaves, and we
keep it as the master after crash-recovery. _Or_ we need to be sure at least
one slave is ahead of the master, and we promote that slave as the new
master and demote the old master to a slave after crash recovery. Otherwise
the replication hierarchy cannot be reliably re-assembled after a master
crash.

> From what I know and have seen, it is used when something fully
> automatic like MaxScale is used to handle failovers and rejoining of
> nodes to the cluster. Without it, I think that you would eventually
> have to start restoring the nodes from backups once enough failovers
> have happened.

What about the following idea?

1. Implement BEFORE_WRITE semi-sync mode. The master will not write
   transactions to the binlog until at least one slave have acknowledged.

2. This means that if the master crashes, when it comes back up it will have
   no transaction that does not exists on at least one running node
   (assuming at most a single failure at a time).

3. When the master restarts, it will go into read-only mode and wait for
   MaxScale (or other management system) to tell it what to do, similar to
   MDEV-34878.

4. If MaxScale decides to keep it as the master, it will briefly set it up
   as a slave and make sure it has replicated the latest GTID on any slave
   in the replication topology. Then it will be set read-write and continue
   as the master.

5. If MaxScale decides to promote another server as the new master, the old
   master is kept in read-only mode and configured as a slave. The
   BEFORE_WRITE ensures the old master will not be ahead of the new master.

This requires the ability in MaxScale to do (4).

I think this will be much more robust than having a crashed server try to
remove transactions already written to the binlog, and having to configure
the server to have one or another role when it starts up.

Instead, all servers in the replication topology always wait at startup for
the manager to replicate any missing transactions from the appropriate
server, and then either set it read-write as a master or continue as a
slave.

What do you think? Of course, this is all for the future, it requires
implementing BEFORE_WRITE in the server first. But I think it sounds
promising.

> I think that implementing semi-sync in each application is probably a
> bit too much but doing it in a proxy like MaxScale does sound doable
> and the implementation would be essentially the same: delay the OK for

It sounds like the new binlog-in-engine should support semi-sync (perhaps
not in the first release, but eventually). It could then support
AFTER_COMMIT, which would be used when a crashed server is allowed to
restart and continue by itself, as is the current default. And then also
support BEFORE_WRITE, where transactions are sent to the slave before being
written to the binlog, and a crashed server comes up in read-only mode after
restart.

MaxScale could still implements its own version, but probably it is best if
the new binlog implementation would also support some form of semi-sync
eventually.

> I had a vague memory of the group commit mechanism doing only one ACK
> per group but I might have remembered it wrong, I'm mostly a passive

I think it still does it for every commit, but this could be improved in the
server (MDEV-33491).

 - Kristian.
_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

Reply via email to