[MariaDB developers] Re: Design Description for MDEV-4698 (keep relay logs at GTID START SLAVE)

Kristian Nielsen via developers Fri, 13 Feb 2026 02:31:55 -0800

Jimmy Hú <[email protected]> writes:

>> So which of these two is it for this project?
>
> Perhaps by coincidence, both https://jira.mariadb.org/browse/MDEV-8945
> and https://jira.mariadb.org/browse/MDEV-4698 are due to this
> limitation, but MDEV-8945 complains about network costs while
> MDEV-4698 discusses reliability.
> Though, the more reliable the relay log is, the fewer scenarios it’d
> need to redownload from the master.
>
>> So this sounds like we are implementing my point (1) above, and not my point
>> (2)?
>
> Without scanning every GTID from `Gtid_Slave_Pos` onwards, this is the
> only best-effort scenario.
> The state cache would help, but if it’s unavailable, the equivalent
> procedure for the binary log is currently entangled with XA recovery.
> It should only affect old relay logs from server upgrades, though, as
> new logs have that GTID List cheat code.


Sure, old relay logs can have limitations, and there can be other
limitations.

But I think it is really important that you make it _very_ clear: if this
project is about giving guarantees about preserving relay logs, and if so
make it very clear what the limitations/exceptions are. Or if this is about
performance/network cost, and there are no firm guarantees.

>> Thus, I will be a necessary part of this project to explicitly list the full
>> rules for how GTID connect work.
>
>> The scary part here is how to be sure stiching together two separate streams
>> from the master will be done correctly in all cases.
>
>> Do you understand the difference between a GTID position (as in
>> @@gtid_slave_pos) and a GTID state (as in GTID_LIST event)?
>
> A critical discontinuity is that `Gtid_Slave_Pos` – and `Gtid_IO_Pos`
> by extension – is merely a position set.
> But replication by GTID positions – rather than by “states” – cannot
> disambiguate out-of-order domains, let alone invisible changes on the
> other end of the connection.
> If this is a limitation, it's in the fundamentals that this project builds 
> upon.

This isn't very clear.

A GTID _position_ specifies a point in each domain_id event stream. It is
independent of any particular binlog or server.

A GTID _state_ is associated with a specific location in a specific binlog.
It describes the set of GTIDs that are contained in that particular binlog
prior to that location.

> For now, saving a GTID position in the relay logs’ GTID Lists should suffice.

Why? The whole purpose of GTID_LIST in the binlogs are exactly to specify
the GTID _state_ at that location (not the GTID position)?

Please, it's really critical that a change like this is designed with a
proper understanding of exactly how the GTIDs work. Implementing a
fundamental change like this to how GTID positions are handled, without a
full understanding of the subtleties of the semantics, sounds like a recipe
for disaster.

> Until a smart DBA replicates straight from these relay logs instead of
> setting `--log-slave-updates=ON`.
> (Actually, could the future replace `--log-slave-updates` with
> multi-source binary logs?)
>
>>  - The START SLAVE UNTIL master_gtid_pos is currently handled mostly on the
>>    master side. How will this be handled when continuing existing relay
>>    logs?
>
> What happens if replication is interrupted before reaching this
> `master_gtid_pos`, say, by STOP SLAVE?

STOP SLAVE cancels the UNTIL, which is currently not a problem because the
relay logs are then reset. If only the IO thread is stopped and started, I
think it needs to preserve the UNTIL (I didn't check) - but this nasty
partial-restart logic is exactly what we'd want to fix by proper save-side
GTID positioning.

But you did not answer how *this* feature/design will handle continuing
relay logs when START SLAVE UNTIL is used?

>>  - What about the semantics for skipping event groups, which IIRC can
>>    sometimes happen on the master?
>
> Many filters use some combination of master-side, IO-side, or SQL-side 
> skipping…
> It’s a good question what their intended rules were.

Yes, a good question, but what is the answer?
Is the design/architecture of this feature still a work-in-progress, and it
will be extended and completed eventually (which is totally fair, but again,
important to clarify)?

> Perhaps it was not clarified enough, but (unlocated) `Gtid_Slave_Pos`
> and disjoint worker positions are both restart positions.

I don't understand this. I don't understand what you mean by "unlocated" and
"disjoint" here. Or with "restart position".

>> It sounds like this is trying too hard to re-use existing (ugly) code in
>> parallel replication related to trying to restart the SQL thread while the
>> IO thread keeps running. That code can hopefully be discarded with this
>> project implemented?
>
> Ideally, yes; but this code on the parallel side is a bit too
> entangled, so this step may overlap with the general merger of serial
> and parallel replication code.
>
>> Modifying the existing code piece by piece will not end up with a clean
>> design, I fear.
>

> I not only agree, but I also fear the existing code was never cleanly
> designed in the first place.

It was quite a long time ago, but specifically to the code that tries to
handle restarting one-but-not-the-other of IO and SQL thread, I think that
was indeed not considered properly in the original design, but was an
oversight. So discovered as bugs, and then had to be patched over.

Design bugs are the worst kind of bugs, because they can be not just hard to
debug, but also hard to fix without major code rewrites (and that's of
course why I'm so insistent on details of the design for this project).

> Similarities between master binlog dump threads, slave SQL threads,
> and `mariadb-binlog` are unutilized.
> Even the code for serial and parallel modes is largely disjoint.

> Heck, if I have the opportunity, I’d *demolish* the existing
> structures to better fit Binlog-in-Engine’s reëstablishment of the
> concept of transaction logs.
>
>> I *strongly* disagree with doing anything like this in a stable release,
>> much less old 10.11.
>
> Ack.
> Changes would be less constrained on the `main` branch, too.
>
> ---
>
>> Or do we somehow try to merge the Gtid_Slave_Pos into the current relay log
>> state so we can reconnect to the master at 0-1-150,1-2-100?
>
> Oops, I forgot to include this line in my notebook.
> My current draft includes the `Gtid_Slave_Pos` when restoring the 
> `Gtid_IO_Pos`.
> After all, `Gtid_Slave_Pos` may increment outside of this relay log,
> specifically in other multi-source channels.

What is the precise algorithm for doing this?

>> Or should we perhaps change things so that the skipping happens entirely on
>> the slave side?
>
> This… would not meet either goal; it might even worsen the condition.

Why, what goal would not be met?

It seems to me this would in many ways be a cleaner design, it would avoid
having the logic to determine GTID start and stop position be duplicated on
the master side and slave side.

I do see that it's perhaps a too large change to implement,
re-architecturing the whole GTID logic in the dump thread. But the
possibility still needs to be considered and pros/cons understood.

>> One thing not mentioned is some kind of relay log recovery, to make it
>> consistent when restarting after a crash.
>
> Ah, it wasn’t explicitly mentioned, but the process that recovers
> `Gtid_IO_Pos` also counts
> `Master_info::events_queued_since_last_gtid`.
> It occurs during initialization, which is not quite server startup,
> but still before any SQL thread can read the log.

If done here, wouldn't it be much better to truncate the relay log at the
end of the last completed event group? I mean, you'll need to truncate it in
any case if it ends in a partial event, right?

I'd really hope that this project would be able to remove the
`events_queued_since_last_gtid` completely. Wouldn't that be possible? At
least for the SQL thread?

>> It could contain partial event groups or even events.
>
> The project currently relies on `@@relay_log_recovery` in this
> situation, whether it was manually set or because `Gtid_IO_Pos`
> recovery hit a corruption.

> Events are flushed in wholes, FWIW.

So does this mean that we will rely on the file system to never end up with
a partial event at the end of the relay log?
Have you researched if that is a guarantee provided by Posix/Linux/specific
file systems?

> (Wait, the SQL thread detects EOF as part of a
> `LOG_EVENT_MINIMAL_HEADER_LEN`-sized read, but if unfinished writes
> could look like half-events of arbitrary lengths, won’t the SQL thread
> be fooled about the relay log’s status?)
>
>> Maybe we can just truncate it to the last full event
>> group?
>
> In fact, the code documentation of `@@relay_log_recovery`’s code,
> `init_recovery()`, says:
>
>> In the feature, we should improve this routine in order to avoid throwing
>> away logs that are safely stored in the disk. Note also that this recovery
>> routine relies on the correctness of the relay-log.info and only tolerates
>> coordinate problems in master.info.
>
> So this would be a distinct sub-improvement that benefits non-GTID mode as 
> well.
>
>> Or maybe start from the end of the last full event group in the relay log,
>> and compare events received from the master after that with the events in
>> any following partial event group; and complain if they differ. This would
>> be a more robust protection against getting half an event group from one
>> master and combine it with another half of a different event group from
>> another master;
>
> Hm, indeed.
> But is the current guard, which only checks the GTID, sufficient,
> given that the ‘G’ in “GTID” expects the two masters to send identical
> groups for the same ‘TID’?

Unfortunately this is tricky.

We cannot expect in the general case to get identical event groups for the
same GTID. Because the events are re-generated during slave apply, they are
not copied from the original master events. The event groups will be
equivalent, they will produce the same changes. But not in general binary 
identical.

In principle we cannot even be sure to have the same *number* of events in
the event groups received from two different masters. Because replicating an
event group can convert from statement to row form, for example.

And we cannot even be sure to _know_ if reconnect reaches the same or a new
master! Because what if the network is set up with round-robin DNS, for
example? Probably we could rely on checking server_id though.

All of this is a bit too much probably, and IIRC the current code has some
logic to try to deal with this, but not completely, there should be some
comments about it.

But all of this mess is one reason why I'd _really_ like to see the whole
mess with `events_queued_since_last_gtid` go away, and just use GTID and
consider only positions at the start of event groups.

>> I don't understand this, as I thought the Gtid_IO_Pos is a property of the
>> relaylog (just like @@gtid_binlog_pos is a property of the binlog), not
>> something that can be meaningfully configured?
>
> The thought was that the IO thread now connects using a maintained
> `Gtid_IO_Pos` rather than restarting from `Gtid_Slave_Pos`.
> So, configuring `Gtid_IO_Pos` allows fast-forwarding or rewinding the
> download position independent of the SQL thread(s) (`Gtid_Slave_Pos`).

Is this really a good idea? What would be the use of this?

>> In any case, I think it is reasonable to add changing the current GTID
>> position as one case that causes re-initialization of the relay logs, and it
>> will avoid one set of tricky problems.
>
> Indeed, an alternative is changing `Gtid_IO_Pos` when `Gtid_Slave_Pos`
> changes (by requiring redownloads of **all** relay logs).
>
>> Maybe it's fine, we'd have to assume
>> the backup preserves both mysql.gtid_slave_pos table and relay log, not sure.
>
> Or the restored server must re-replicate from the restored
> `mysql.gtid_slave_pos` one way or another.

 - Kristian.
_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB developers] Re: Design Description for MDEV-4698 (keep relay logs at GTID START SLAVE)

Reply via email to