Jimmy Hú <[email protected]> writes: > As you requested, I have written a design description for my MDEV-4698 > <https://jira.mariadb.org/browse/MDEV-4698>/MDEV-33645 > <https://jira.mariadb.org/browse/MDEV-33645> project so far.
> Please let me know your thoughts, especially on how well this plan fits in > the big picture. Thanks, Jimmy, for the design writeup. I do appreciate that you've spent effort to understand the many issues around this project. The design writeup is closely tied to the details of the existing replication code and how it should be modified. I think starting from the code is a good way, it is also how I do large projects, it is necessary to start from somewhere. And then at some point, it is necessary to take a step back and consider the design and architecture from a higher-level point of view. Describe the problem and the various possible solutions independent of the current code, discuss pros and cons of each, and explain the reasons for choosing a particular one. I read the description several times, and I found it hard to understand the high-level picture from it. This is not personal critisism, on the contrary I know very well how hard it is to understand the replication code and how to extend it with new development, it seems a lot of good progress has been done here. Just trying to explain the overall reasoning behind my comments here. I think the correct design and understanding of all the details is like 80% of the work in this task. First, what is the overall goal of this project? I can think of two possibilities: 1. Performance optimisation. When slave restarts, avoid re-downloading binlog data from the master, saving master load and network band-width. If this is the goal, then preserving relay-logs can be best-effort, and any corner cases where this might be unsafe or inconvenient can always fall back to discarding existing relay logs and re-download from the master starting at Gtid_Slave_Pos. 2. Safe copy of master binlogs. So that once a transaction is in the relay log on a slave, it has some kind of guarantee to be available on that slave even if the binlog on the master disappears, or the master itself does. This requires clear and well-documented rules for exactly when relay logs will be preserved and when they will be deleted. So add-hoc fallbacks in corner cases will no longer be as easy to do to solve tricky details, So which of these two is it for this project? Next, the overall approach for locating GTID. 3. In the "start from scratch" case with @@relay_log_recovery and the like. I assume that in this case, the existing behaviour is to be preserved 100%, the exact same rules will be used for how a GTID start position will be handled in all the different possible scenarios. Only internally might there be some re-factoring to share code, but the logic will be unchanged. Is this correctly understood? I think this is important to make clear. Because the main requirement for a change like this will be that the behaviour must be 110% the same between the cases where the relay log is deleted and re-downloaded from scratch, and the case where slave continues from the existing relay logs. Thus, I will be a necessary part of this project to explicitly list the full rules for how GTID connect work. This is not trivial, there are complex rules and a number of different cases. For example: - The rules for correctly handling out-of-order sequence numbers between different server IDs. - There are special rules for when a domain that is not found in the master's binlog result in an error, and when they result in that domain being ignored. - There are subtly different semantics between gtid_strict_mode 0 and 1. - The START SLAVE UNTIL master_gtid_pos is currently handled mostly on the master side. How will this be handled when continuing existing relay logs? - What about the semantics for skipping event groups, which IIRC can sometimes happen on the master? - ... etc. - what is the full set of rules? I think the confidence in the correctness of this change will to a large extend be determined by the quality of the description of what these rules are, and how the design will ensure that the rules will be the same in the new mode where existing relay logs are re-used. It is clear that removing this limitation of deleting relay logs in GTID mode would be a very nice improvement. It will also be nice to get rid of the tricky special cases in the code that try to partially handle the issue when only one of the IO or SQL threads are stopped and restarted, this code is complex and error-prone. The challenge is to do this correctly. 4. The IO thread. From the description, my understanding is that the IO thread will now have two cases: (a) If no relay logs, connect to the master at GTID position Gtid_Slave_Pos, as before. (b) If continuing from existing relay logs, the IO thread will connect from the positing in the existing relay logs that corresponds to @@gtid_binlog_pos of the binlogs. The scary part here is how to be sure stiching together two separate streams from the master will be done correctly in all cases. Some things make this harder, when things change from outside the server without the slave code being aware. For example, after a mariadbd restart the configuration may have changed. If the changes affect the contents of the relay log in any way, it could result in the change taking effect at a random point in the replication stream, which must be avoided. Or what if the DNS changes and the master server is now different? Or if the master.info was modified manually? All the possible issues around such cases must be understood and documented, regardless if they are to be handled automatically or documented as limitations. The complexities around such issues is the major reason for the original limitation of deleting the relay logs; it is hard for the replication code to be robust to all the possible changes that can happen between server stop and server start. 5. The SQL thread. What is the overall plan for how to start the SQL thread (when using GTID)? It was not clear to me from the description. To me, then main idea of a project like this should be to implement the logic on how to start in the relay log at current Gtid_Slave_Pos on the slave side, in the SQL driver thread. This mechanism only exists on the master side in current code. And with this is in place, can't the SQL driver thread simply start reading events from the relay log, skipping any that have not yet reached their start GTID, or that _have_ reached their UNTIL position? And then there should be no more need for any code to differ between the parallel and non-parallel replication, nor any special handling for skipping events not yet reached to their position? Because all the logic can be in the shared code, prior to calling rpl_parallel::do_event() or apply_event_and_update_pos()? Is that what you are planning to do, or is there some reason this is not possible? Now, a more specific question. One of the issues to solve is what happens after a crash, where part of the (end of the) relay log may be lost, or even after a restart where it is intact. How can the IO thread determine where to continue? The start of the binlog was from some Gtid_Slave_Pos, where the master filters out events that are prior to each domain's starting point. But after a restart, this original starting GTID position is no longer known. And the current relay log may have reached the starting position in one domain but not in the other. Ie. suppose we start a new relay log from 0-1-100,1-2-100. We receive up to binlog state 0-1-150,1-2-50. Then we restart the slave. Do we reconnect to the master at the current relay log position of 0-1-150,1-2-50 ? Then from this point on, we will have extra events in the relay log that would not have been there without the restart, and the SQL thread will now have the task of skipping them based on Gtid_Slave_Pos. Or do we somehow try to merge the Gtid_Slave_Pos into the current relay log state so we can reconnect to the master at 0-1-150,1-2-100? Or should we perhaps change things so that the skipping happens entirely on the slave side? So that now the initial slave connect will have the master's dump thread send _all_ events starting from somewhere at or before the event 0-1-100 (say 0-1-90 for example), and all the extra events will be skipped by the SQL thread? This way the logic would be in one place and would be the same regardless of slave reconnects or restarts. Next onto detailed comments: > Fixes MDEV-4698 > Resolves MDEV-33645 > > Note > > Gtid_Slave_Pos in this description refers to @@gtid_slave_pos in > master_use_gtid=slave_pos and @@gtid_current_pos in > master_use_gtid=current_pos. > > In GTID replication, START SLAVE deletes the relay log if neither IO nor > SQL threads are running. While this was an intentional design for crash > compatibility, it places an unnecessary requirement to redownload the > relay log from the master. > > The goal of this project is to lift this limitation for all cases (except > explicit crash recovery, @@relay_log_recovery). > It is doable on 10.11 with no major refactorings, though I’m open to > rebasing to the main branch if we deem these changes in behaviour or code > too impactful. I *strongly* disagree with doing anything like this in a stable release, much less old 10.11. > @@relay_log_recovery, RESET SLAVE, and CHANGE MASTER (which covers > changing MASTER_USE_GTID) continue to delete the relay log and rewind > Gtid_IO_Pos to Gtid_Slave_Pos. Sounds reasonable. > The majority of the project, however, is filling in the gaps in position > recovery mechanisms for both the IO and SQL threads. They have not only > been neglected, but have also relied on this very limitation. Once the > threads restore their states, they can continue from where they left off > as if they paused normally. Right. However, as I tried to explain above, the project needs to design a proper mechanism from continuing from where they left off and implement that. Modifying the existing code piece by piece will not end up with a clean design, I fear. > IO Thread > > The IO Thread sends Gtid_IO_Pos to the master, which then omits event > groups before this position for the slave. Unlike Gtid_Slave_Pos, however, > this position is not persisted. As persistence (in a .state file?) would > require trade-offs between performance and crash safety during operation, > reloading it by scanning the relay log is more preferable, at least as a > capability. > > Notably, @@gtid_binlog_pos already has this procedure for when the > log-bin.state file is unusable. > > The (initialization of) relay logs can leverage it – minus the TC-specific > steps – to restore the GTID position at the appending end, including the > event count in a partial most-recent group, > Master_info::events_queued_since_last_gtid. > > Though a missing puzzle piece for scanning the relay log is that, unless > the startup scans every GTID between Gtid_Slave_Pos and the appending end, > the scan would need a position list event to be certain about all domains > with this portion. > > Relay logs’ GTID Lists currently mostly only come from the master, however. > > Side Effect: For convenience, I enabled header GTID Lists in relay logs to > match the master binary logs. > > This guaranteed position list enables the procedure to only scan the last > log file for both binary and relay logs. Agree that having guaranteed GTID list in the relay log files sounds right. Do you understand the difference between a GTID position (as in @@gtid_slave_pos) and a GTID state (as in GTID_LIST event)? (I assume you do, but want to confirm as this point is so essential.) The GTID state for the GTID_LIST event should be easy to maintain, it's just updating the starting GTID state with each GTID as it is seen. It requires that the master sends the starting GTID state on slave connect (I recall that it sends a fake GTID_LIST, but don't recall if this is a GTID state or just a GTID position). I'm not 100% sure of the implication when events are skipped master-side, if that is a concern since it will make the master-side/binlog and slave-side/relaylog GTID states out of sync. And hopefully the SQL driver thread would then _always_ use the GTID search procedure for where to start (in GTID mode). Indeed, it sounds like this will require having GTID indexes on the relay logs. But I think the code in gtid_index.cc should be directly usable for this, needs to be checked though. > The scanning procedure will fail out of caution if the relay log doesn’t > have a GTID List anywhere in this file, and the position reload falls back > to @@relay_log_recovery to perform “the current behaviour” one last time. So this sounds like we are implementing my point (1) above, and not my point (2)? > Room for Improvement: This header GTID List and the ability to restore > GTID position bring the Relay Log closer to the master-side Binary Log, so > it’d be worth consolidating their (and possibly also mariadb-binlog’s) > unique but equivalent code, such as @@gtid_binlog_pos & Gtid_IO_Pos, and > the position-aware event reading portions of dump thread and SQL thread’s > loops. > SQL Thread > > As the GTID system was built on top of the file-position system, the SQL > thread still depends on relay_log_file/pos even in GTID mode, but for > performance, it does not save this file-position to @@relay_log_info_file > every transaction (since START SLAVE will delete the relay log anyways). > > This dependence is the opposite of MDEV-37584’s goal of leveraging > Gtid_Slave_Pos so we don’t have to keep @@relay_log_info_file > synchronized. > > In addition, since turning parallel replication off no longer deletes the > relay log (by START SLAVE), serial START SLAVE SQL_THREAD now also needs > to prepare for (multi-domain) parallel replication stopped with worker > threads at different positions, with some possibly earlier than the one > that goes into @@relay_log_info_file. Indeed. Since the slave server could be stopped, --slave-parallel-threads changed to 0, and restarted, and the slave would have changed from parallel to non-parallel replication without any way for the code to know the prior state. But won't this be handled uniformly by both modes, when the skipping of events before each event's starting position happens in the common part of the SQL driver thread, in exec_relay_log_event() maybe? > All of these mean that, until parallel and serial modes merge their code, > I must port parallel replication’s ability to skip already-executed groups > as if turning @@gtid_ignore_duplicates on for this duration. > > (Non-group events such as Format Description and Rotate cannot be skipped.) > > After multiple iterations that cause side effects, I settled on turning > parallel replication’s Relay_log_info::gtid_skip_flag into yet another > skip-event flag. This flag is then updated and checked in serial-specific > code only to avoid conflicting with parallel replication (workers). (Here, > it’s updated at GTID events only, but it shouldn’t hurt to update at group > endings as well, which (theoretically) can utilize > Relay_log_info::is_in_group().) I don't really follow this explanation. It sounds like this is trying too hard to re-use existing (ugly) code in parallel replication related to trying to restart the SQL thread while the IO thread keeps running. That code can hopefully be discarded with this project implemented? > In multi-domain replication, the SQL thread must first rewind its position > to return to the workers’ positions during its natural flow, in case of > parallel workers stopped at different positions. > > Serial replication also has this requirement, in case the processing was > previously in parallel. > > Side Effect & Limitation: Without any indexing on the relay log (Do we > want this runtime tax?), the SQL thread has to rewind to the beginning of > the (available) relay log. > > Room for Improvement: We can make use of the newly guaranteed header GTID > Lists to scan backward or binary search instead, especially if > @@relay_log_purge=OFF, but this capability is better as a MYSQL_BIN_LOG > function extracted from the binlog dump thread’s main loop. > > This procedure exists in STOP SLAVE for parallel replication; I moved it > to START SLAVE for both parallel and serial replication, so not only will > this preparation account for any changes in Gtid_Slave_Pos, but server > reboots will also perform this. > > This section also includes starting with the skip flag set in case of a > relay log switch mid-group. (Can it?) New open issues around > Gtid_Slave_Pos changes > SQL Thread > > As of this draft, changing Gtid_Slave_Pos does not invalidate the relay > logs for all connections, and the SQL threads will directly resume from > the new position(s) in the existing relay log. This doesn't sound safe? What if the position gets set backwards? Then the starting position may be before the start of the relay log in all or only some domains. I think it would be reasonable to delete the relay logs when the @@gtid_slave_pos is changed. But what if the server is restored from a backup (which changes the @@gtid_slave_pos without the server code being aware of it)? Can we be sure that we can handle correctly this case? Maybe it's fine, we'd have to assume the backup preserves both mysql.gtid_slave_pos table and relay log, not sure. In any case, I think it is reasonable to add changing the current GTID position as one case that causes re-initialization of the relay logs, and it will avoid one set of tricky problems. > While this matches CHANGE MASTER relay_log_file/pos, those who are used to > START SLAVE deleting the relay log may not expect the slave to redownload > any event groups earlier than the relay log contents. > > We may need additional defence against this fringe scenario. > > In comparison, CHANGE MASTER relay_log_file/pos errors if the position > lies outside of the existing relay log. > IO Thread (Gtid_IO_Pos) > > Reusing the existing relay log means Gtid_IO_Pos will likely resume > differently than Gtid_Slave_Pos. > > Yet, as of this draft, the only way to change Gtid_IO_Pos – which may > require invalidating the relay log – is through RESET SLAVE, which rewinds > it to Gtid_Slave_Pos as a necessity for deleting the relay log. Isn't Gtid_IO_Pos effectively a name for the GTID position corresponding to the end of the relay log? So the Gtid_IO_Pos is changed by anything that changes the relay log, and nothing else. One thing not mentioned is some kind of relay log recovery, to make it consistent when restarting after a crash. It could contain partial event groups or even events. Maybe we can just truncate it to the last full event group? Must probably be done at server startup, before any SQL thread would start to read the to-be-truncated part. Or maybe start from the end of the last full event group in the relay log, and compare events received from the master after that with the events in any following partial event group; and complain if they differ. This would be a more robust protection against getting half an event group from one master and combine it with another half of a different event group from another master; and it could get rid of all the ugly stuff related to Master_info::events_queued_since_last_gtid. > A proper way to configure Gtid_IO_Pos per IO thread on the main branch (if > this project targets LTS) may be desirable for fine control equivalent to > CHANGE MASTER master_log_file/pos. I don't understand this, as I thought the Gtid_IO_Pos is a property of the relaylog (just like @@gtid_binlog_pos is a property of the binlog), not something that can be meaningfully configured? > It could also remedy the aforementioned gap for the SQL thread. > > Regarding CHANGE MASTER > > This project is not about how CHANGE MASTER also deletes the relay log, > even for connectivity tunings such as SSL configurations and the retry > interval. This project warns early, though, that there are also components > that depend on this limitation, including this project (e.g., so it > doesn’t have to defend against master_use_gtid changes). > > How can this PR be tested? > > The new MTR test rpl.rpl_gtid_restart_slave demonstrates MDEV-4698’s > problem at the surface. Existing tests cover the in-depth concerns around > removing this anti-feature, namely rpl.rpl_gtid_crash for crash recovery. That's as far as I came with comments for now, there's plenty already. - Kristian. _______________________________________________ developers mailing list -- [email protected] To unsubscribe send an email to [email protected]
