Jimmy Hú <[email protected]> writes:

> As you requested, I have written a design description for my MDEV-4698
> <https://jira.mariadb.org/browse/MDEV-4698>/MDEV-33645
> <https://jira.mariadb.org/browse/MDEV-33645> project so far.

> Please let me know your thoughts, especially on how well this plan fits in
> the big picture.

Thanks, Jimmy, for the design writeup. I do appreciate that you've spent
effort to understand the many issues around this project.

The design writeup is closely tied to the details of the existing
replication code and how it should be modified. I think starting from the
code is a good way, it is also how I do large projects, it is necessary to
start from somewhere. And then at some point, it is necessary to take a step
back and consider the design and architecture from a higher-level point of
view. Describe the problem and the various possible solutions independent of
the current code, discuss pros and cons of each, and explain the reasons for
choosing a particular one.

I read the description several times, and I found it hard to understand the
high-level picture from it. This is not personal critisism, on the contrary
I know very well how hard it is to understand the replication code and how
to extend it with new development, it seems a lot of good progress has been
done here. Just trying to explain the overall reasoning behind my comments
here. I think the correct design and understanding of all the details is
like 80% of the work in this task.

First, what is the overall goal of this project? I can think of two
possibilities:

1. Performance optimisation. When slave restarts, avoid re-downloading
binlog data from the master, saving master load and network band-width. If
this is the goal, then preserving relay-logs can be best-effort, and any
corner cases where this might be unsafe or inconvenient can always fall back
to discarding existing relay logs and re-download from the master starting
at Gtid_Slave_Pos.

2. Safe copy of master binlogs. So that once a transaction is in the relay
log on a slave, it has some kind of guarantee to be available on that slave
even if the binlog on the master disappears, or the master itself does. This
requires clear and well-documented rules for exactly when relay logs will be
preserved and when they will be deleted. So add-hoc fallbacks in corner
cases will no longer be as easy to do to solve tricky details,

So which of these two is it for this project?

Next, the overall approach for locating GTID.

3. In the "start from scratch" case with @@relay_log_recovery and the like. I
assume that in this case, the existing behaviour is to be preserved 100%,
the exact same rules will be used for how a GTID start position will be
handled in all the different possible scenarios. Only internally might there
be some re-factoring to share code, but the logic will be unchanged. Is this
correctly understood?

I think this is important to make clear. Because the main requirement for
a change like this will be that the behaviour must be 110% the same between
the cases where the relay log is deleted and re-downloaded from scratch, and
the case where slave continues from the existing relay logs.

Thus, I will be a necessary part of this project to explicitly list the full
rules for how GTID connect work. This is not trivial, there are complex
rules and a number of different cases. For example:

 - The rules for correctly handling out-of-order sequence numbers between
   different server IDs.

 - There are special rules for when a domain that is not found in the
   master's binlog result in an error, and when they result in that domain
   being ignored.

 - There are subtly different semantics between gtid_strict_mode 0 and 1.

 - The START SLAVE UNTIL master_gtid_pos is currently handled mostly on the
   master side. How will this be handled when continuing existing relay
   logs?

 - What about the semantics for skipping event groups, which IIRC can
   sometimes happen on the master?

 - ... etc. - what is the full set of rules?

I think the confidence in the correctness of this change will to a large
extend be determined by the quality of the description of what these rules
are, and how the design will ensure that the rules will be the same in the
new mode where existing relay logs are re-used.

It is clear that removing this limitation of deleting relay logs in GTID
mode would be a very nice improvement. It will also be nice to get rid of
the tricky special cases in the code that try to partially handle the issue
when only one of the IO or SQL threads are stopped and restarted, this code
is complex and error-prone. The challenge is to do this correctly.


4. The IO thread. From the description, my understanding is that the IO
thread will now have two cases:

(a) If no relay logs, connect to the master at GTID position Gtid_Slave_Pos,
as before.

(b) If continuing from existing relay logs, the IO thread will connect from
the positing in the existing relay logs that corresponds to
@@gtid_binlog_pos of the binlogs.

The scary part here is how to be sure stiching together two separate streams
from the master will be done correctly in all cases. Some things make this
harder, when things change from outside the server without the slave code
being aware. For example, after a mariadbd restart the configuration may
have changed. If the changes affect the contents of the relay log in any
way, it could result in the change taking effect at a random point in the
replication stream, which must be avoided. Or what if the DNS changes and
the master server is now different? Or if the master.info was modified
manually?

All the possible issues around such cases must be understood and documented,
regardless if they are to be handled automatically or documented as
limitations. The complexities around such issues is the major reason for the
original limitation of deleting the relay logs; it is hard for the
replication code to be robust to all the possible changes that can happen
between server stop and server start.


5. The SQL thread. What is the overall plan for how to start the SQL thread
(when using GTID)? It was not clear to me from the description.

To me, then main idea of a project like this should be to implement the
logic on how to start in the relay log at current Gtid_Slave_Pos on the
slave side, in the SQL driver thread. This mechanism only exists on the
master side in current code.

And with this is in place, can't the SQL driver thread simply start reading
events from the relay log, skipping any that have not yet reached their
start GTID, or that _have_ reached their UNTIL position?

And then there should be no more need for any code to differ between the
parallel and non-parallel replication, nor any special handling for skipping
events not yet reached to their position? Because all the logic can be in
the shared code, prior to calling rpl_parallel::do_event() or
apply_event_and_update_pos()?

Is that what you are planning to do, or is there some reason this is not
possible?


Now, a more specific question.
One of the issues to solve is what happens after a crash, where part of the
(end of the) relay log may be lost, or even after a restart where it is
intact.

How can the IO thread determine where to continue? The start of the binlog
was from some Gtid_Slave_Pos, where the master filters out events that are
prior to each domain's starting point. But after a restart, this original
starting GTID position is no longer known. And the current relay log may
have reached the starting position in one domain but not in the other.

Ie. suppose we start a new relay log from 0-1-100,1-2-100. We receive up to
binlog state 0-1-150,1-2-50. Then we restart the slave. Do we reconnect to
the master at the current relay log position of 0-1-150,1-2-50 ? Then from
this point on, we will have extra events in the relay log that would not
have been there without the restart, and the SQL thread will now have the
task of skipping them based on Gtid_Slave_Pos.

Or do we somehow try to merge the Gtid_Slave_Pos into the current relay log
state so we can reconnect to the master at 0-1-150,1-2-100?

Or should we perhaps change things so that the skipping happens entirely on
the slave side? So that now the initial slave connect will have the master's
dump thread send _all_ events starting from somewhere at or before the event
0-1-100 (say 0-1-90 for example), and all the extra events will be skipped
by the SQL thread? This way the logic would be in one place and would be the
same regardless of slave reconnects or restarts.


Next onto detailed comments:

>     Fixes MDEV-4698
>         Resolves MDEV-33645
> 
> Note
> 
> Gtid_Slave_Pos in this description refers to @@gtid_slave_pos in
> master_use_gtid=slave_pos and @@gtid_current_pos in
> master_use_gtid=current_pos.
> 
> In GTID replication, START SLAVE deletes the relay log if neither IO nor
> SQL threads are running. While this was an intentional design for crash
> compatibility, it places an unnecessary requirement to redownload the
> relay log from the master.
> 
> The goal of this project is to lift this limitation for all cases (except
> explicit crash recovery, @@relay_log_recovery).

> It is doable on 10.11 with no major refactorings, though I’m open to
> rebasing to the main branch if we deem these changes in behaviour or code
> too impactful.

I *strongly* disagree with doing anything like this in a stable release,
much less old 10.11.

> @@relay_log_recovery, RESET SLAVE, and CHANGE MASTER (which covers
> changing MASTER_USE_GTID) continue to delete the relay log and rewind
> Gtid_IO_Pos to Gtid_Slave_Pos.

Sounds reasonable.

> The majority of the project, however, is filling in the gaps in position
> recovery mechanisms for both the IO and SQL threads. They have not only
> been neglected, but have also relied on this very limitation. Once the
> threads restore their states, they can continue from where they left off
> as if they paused normally.

Right.
However, as I tried to explain above, the project needs to design a proper
mechanism from continuing from where they left off and implement that.
Modifying the existing code piece by piece will not end up with a clean
design, I fear.

> IO Thread
> 
> The IO Thread sends Gtid_IO_Pos to the master, which then omits event
> groups before this position for the slave. Unlike Gtid_Slave_Pos, however,
> this position is not persisted. As persistence (in a .state file?) would
> require trade-offs between performance and crash safety during operation,
> reloading it by scanning the relay log is more preferable, at least as a
> capability.
> 
> Notably, @@gtid_binlog_pos already has this procedure for when the
> log-bin.state file is unusable.
> 
> The (initialization of) relay logs can leverage it – minus the TC-specific
> steps – to restore the GTID position at the appending end, including the
> event count in a partial most-recent group,
> Master_info::events_queued_since_last_gtid.
> 
> Though a missing puzzle piece for scanning the relay log is that, unless
> the startup scans every GTID between Gtid_Slave_Pos and the appending end,
> the scan would need a position list event to be certain about all domains
> with this portion.
> 
> Relay logs’ GTID Lists currently mostly only come from the master, however.
> 
> Side Effect: For convenience, I enabled header GTID Lists in relay logs to
> match the master binary logs.
> 
> This guaranteed position list enables the procedure to only scan the last
> log file for both binary and relay logs.

Agree that having guaranteed GTID list in the relay log files sounds right.
Do you understand the difference between a GTID position (as in
@@gtid_slave_pos) and a GTID state (as in GTID_LIST event)? (I assume you
do, but want to confirm as this point is so essential.)

The GTID state for the GTID_LIST event should be easy to maintain, it's just
updating the starting GTID state with each GTID as it is seen. It requires
that the master sends the starting GTID state on slave connect (I recall
that it sends a fake GTID_LIST, but don't recall if this is a GTID state or
just a GTID position). I'm not 100% sure of the implication when events are
skipped master-side, if that is a concern since it will make the
master-side/binlog and slave-side/relaylog GTID states out of sync.

And hopefully the SQL driver thread would then _always_ use the GTID search
procedure for where to start (in GTID mode). Indeed, it sounds like this
will require having GTID indexes on the relay logs. But I think the code in
gtid_index.cc should be directly usable for this, needs to be checked
though.

> The scanning procedure will fail out of caution if the relay log doesn’t
> have a GTID List anywhere in this file, and the position reload falls back
> to @@relay_log_recovery to perform “the current behaviour” one last time.

So this sounds like we are implementing my point (1) above, and not my point
(2)?

> Room for Improvement: This header GTID List and the ability to restore
> GTID position bring the Relay Log closer to the master-side Binary Log, so
> it’d be worth consolidating their (and possibly also mariadb-binlog’s)
> unique but equivalent code, such as @@gtid_binlog_pos & Gtid_IO_Pos, and
> the position-aware event reading portions of dump thread and SQL thread’s
> loops.


> SQL Thread
> 
> As the GTID system was built on top of the file-position system, the SQL
> thread still depends on relay_log_file/pos even in GTID mode, but for
> performance, it does not save this file-position to @@relay_log_info_file
> every transaction (since START SLAVE will delete the relay log anyways).
> 
> This dependence is the opposite of MDEV-37584’s goal of leveraging
> Gtid_Slave_Pos so we don’t have to keep @@relay_log_info_file
> synchronized.
> 
> In addition, since turning parallel replication off no longer deletes the
> relay log (by START SLAVE), serial START SLAVE SQL_THREAD now also needs
> to prepare for (multi-domain) parallel replication stopped with worker
> threads at different positions, with some possibly earlier than the one
> that goes into @@relay_log_info_file.

Indeed. Since the slave server could be stopped, --slave-parallel-threads
changed to 0, and restarted, and the slave would have changed from
parallel to non-parallel replication without any way for the code to know
the prior state.

But won't this be handled uniformly by both modes, when the skipping of
events before each event's starting position happens in the common part of
the SQL driver thread, in exec_relay_log_event() maybe?

> All of these mean that, until parallel and serial modes merge their code,
> I must port parallel replication’s ability to skip already-executed groups
> as if turning @@gtid_ignore_duplicates on for this duration.
> 
> (Non-group events such as Format Description and Rotate cannot be skipped.)
> 
> After multiple iterations that cause side effects, I settled on turning
> parallel replication’s Relay_log_info::gtid_skip_flag into yet another
> skip-event flag. This flag is then updated and checked in serial-specific
> code only to avoid conflicting with parallel replication (workers). (Here,
> it’s updated at GTID events only, but it shouldn’t hurt to update at group
> endings as well, which (theoretically) can utilize
> Relay_log_info::is_in_group().)

I don't really follow this explanation.
It sounds like this is trying too hard to re-use existing (ugly) code in
parallel replication related to trying to restart the SQL thread while the
IO thread keeps running. That code can hopefully be discarded with this
project implemented?

> In multi-domain replication, the SQL thread must first rewind its position
> to return to the workers’ positions during its natural flow, in case of
> parallel workers stopped at different positions.
> 
> Serial replication also has this requirement, in case the processing was
> previously in parallel.
> 
> Side Effect & Limitation: Without any indexing on the relay log (Do we
> want this runtime tax?), the SQL thread has to rewind to the beginning of
> the (available) relay log.
> 
> Room for Improvement: We can make use of the newly guaranteed header GTID
> Lists to scan backward or binary search instead, especially if
> @@relay_log_purge=OFF, but this capability is better as a MYSQL_BIN_LOG
> function extracted from the binlog dump thread’s main loop.
> 
> This procedure exists in STOP SLAVE for parallel replication; I moved it
> to START SLAVE for both parallel and serial replication, so not only will
> this preparation account for any changes in Gtid_Slave_Pos, but server
> reboots will also perform this.
> 
> This section also includes starting with the skip flag set in case of a
> relay log switch mid-group. (Can it?) New open issues around
> Gtid_Slave_Pos changes


> SQL Thread
> 
> As of this draft, changing Gtid_Slave_Pos does not invalidate the relay
> logs for all connections, and the SQL threads will directly resume from
> the new position(s) in the existing relay log.

This doesn't sound safe?

What if the position gets set backwards? Then the starting position may be
before the start of the relay log in all or only some domains.

I think it would be reasonable to delete the relay logs when the
@@gtid_slave_pos is changed.

But what if the server is restored from a backup (which changes the
@@gtid_slave_pos without the server code being aware of it)? Can we be sure
that we can handle correctly this case? Maybe it's fine, we'd have to assume
the backup preserves both mysql.gtid_slave_pos table and relay log, not sure.

In any case, I think it is reasonable to add changing the current GTID
position as one case that causes re-initialization of the relay logs, and it
will avoid one set of tricky problems.

> While this matches CHANGE MASTER relay_log_file/pos, those who are used to
> START SLAVE deleting the relay log may not expect the slave to redownload
> any event groups earlier than the relay log contents.
> 
> We may need additional defence against this fringe scenario.
> 
> In comparison, CHANGE MASTER relay_log_file/pos errors if the position
> lies outside of the existing relay log.


> IO Thread (Gtid_IO_Pos)
> 
> Reusing the existing relay log means Gtid_IO_Pos will likely resume
> differently than Gtid_Slave_Pos.
> 
> Yet, as of this draft, the only way to change Gtid_IO_Pos – which may
> require invalidating the relay log – is through RESET SLAVE, which rewinds
> it to Gtid_Slave_Pos as a necessity for deleting the relay log.

Isn't Gtid_IO_Pos effectively a name for the GTID position corresponding to
the end of the relay log? So the Gtid_IO_Pos is changed by anything that
changes the relay log, and nothing else.

One thing not mentioned is some kind of relay log recovery, to make it
consistent when restarting after a crash. It could contain partial event
groups or even events. Maybe we can just truncate it to the last full event
group? Must probably be done at server startup, before any SQL thread would
start to read the to-be-truncated part.

Or maybe start from the end of the last full event group in the relay log,
and compare events received from the master after that with the events in
any following partial event group; and complain if they differ. This would
be a more robust protection against getting half an event group from one
master and combine it with another half of a different event group from
another master; and it could get rid of all the ugly stuff related to
Master_info::events_queued_since_last_gtid.

> A proper way to configure Gtid_IO_Pos per IO thread on the main branch (if
> this project targets LTS) may be desirable for fine control equivalent to
> CHANGE MASTER master_log_file/pos.

I don't understand this, as I thought the Gtid_IO_Pos is a property of the
relaylog (just like @@gtid_binlog_pos is a property of the binlog), not
something that can be meaningfully configured?

> It could also remedy the aforementioned gap for the SQL thread.
> 
> Regarding CHANGE MASTER
> 
> This project is not about how CHANGE MASTER also deletes the relay log,
> even for connectivity tunings such as SSL configurations and the retry
> interval. This project warns early, though, that there are also components
> that depend on this limitation, including this project (e.g., so it
> doesn’t have to defend against master_use_gtid changes).
> 
> How can this PR be tested?
> 
> The new MTR test rpl.rpl_gtid_restart_slave demonstrates MDEV-4698’s
> problem at the surface. Existing tests cover the in-depth concerns around
> removing this anti-feature, namely rpl.rpl_gtid_crash for crash recovery.

That's as far as I came with comments for now, there's plenty already.

 - Kristian.
_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to