Hi Kristian, Thanks for the review! I'll be taking over for Monty to address/incorporate your comments, see my responses below.
On Mon, Jul 22, 2024 at 8:02 AM Kristian Nielsen via developers < developers@lists.mariadb.org> wrote: > Hi Monty, > > Here is my review of the MDEV-33856 patch: > > > commit cc516f4d9913d732065c7abef983df42e5a8bab3 > (origin/bb-11.6-MDEV-33856-seconds_behind_master) > > Author: Monty <mo...@mariadb.org> > > Date: Tue May 14 23:47:59 2024 +0300 > > > > MDEV-33856 New definition for Seconds_Behind_Master > > If I understand the patch correctly, in introduces a new > `Master_slave_time_diff` value as an alternative to the traditional > Seconds_behind_master, with a clearer semantics. And does not change how > Seconds_behind_master works. > > I think this is fine, and can help to preserve backwards compatibility. > Just > maybe mention this in the commit message, and/or change the bug title to > "Alternative definition for Seconds_Behind_Master". > > Sure, will do. > > > diff --git a/mysql-test/suite/perfschema/t/relaylog.test > b/mysql-test/suite/perfschema/t/relaylog.test > > index 12fc96a8b27..16f72efae98 100644 > > --- a/mysql-test/suite/perfschema/t/relaylog.test > > +++ b/mysql-test/suite/perfschema/t/relaylog.test > > @@ -43,7 +43,7 @@ select > > if (sum_number_of_bytes_read > 0, "MANY", "NONE") as > SUM_NUMBER_OF_BYTES_READ, > > if (sum_number_of_bytes_write > 0, "MANY", "NONE") as > SUM_NUMBER_OF_BYTES_WRITE > > from performance_schema.file_summary_by_instance > > - where file_name like "%master-%" order by file_name; > > + where file_name like "%master-%" and file_name not like "%sql/share%" > order by file_name; > > > > select * from performance_schema.file_summary_by_instance > > where file_name like "%slave-%" order by file_name; > > @@ -96,7 +96,7 @@ sync_slave_with_master; > > -- echo "============ Performance schema on slave ============" > > > > select * from performance_schema.file_summary_by_instance > > - where file_name like "%master-%" order by file_name; > > + where file_name like "%master-%" and file_name not like "%sql/share%" > order by file_name; > > > > select > > substring(file_name, locate("slave-", file_name)) as FILE_NAME, > > What is the reason for these changes (adding match on "%sql/share%" ? > I don't understand the relation to the subject of the patch. > > I'm not immediately sure about this one. I'll investigate, and remove if it is proven unnecessary. > > > diff --git a/mysql-test/suite/rpl/t/master_last_event_time.inc > b/mysql-test/suite/rpl/t/master_last_event_time.inc > > new file mode 100644 > > index 00000000000..0790d557a7b > > --- /dev/null > > +++ b/mysql-test/suite/rpl/t/master_last_event_time.inc > > > > +--connection server_3 > > +--sleep 1 > > +--source include/start_slave.inc > > +--source include/sync_with_master_gtid.inc > > What is the purpose of --sleep 1 here? > > Usually it's wrong to have sleeps in the tests. Sometimes it is correct, > but > then there should be a comment explaining why a sleep is necessary (and > sufficient) here. This sleep looks wrong, since there's a proper wait > immediately after for the slave to sync up. > > Indeed this is probably left-over from something prior, I'll remove it. > > > diff --git a/mysql-test/suite/rpl/t/master_last_event_time_stmt.test > b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test > > new file mode 100644 > > index 00000000000..fe0f16a3db3 > > --- /dev/null > > +++ b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test > > @@ -0,0 +1,84 @@ > > +# > > +# Statement specific tests for master_last_event_time > > +# > > +--source include/have_binlog_format_statement.inc > > > > +# Slave will replicate the above with a row event which will be very > fast > > +# compared to the master event. > > I don't understand this comment. Why will the slave replicate using a row > event, when the test case sets binlog_format=statement? > > This whole check should go in the *_row.test version. I'll move it. > > > diff --git a/sql/log.cc b/sql/log.cc > > index 1cc0fa77f08..63d9c756911 100644 > > --- a/sql/log.cc > > +++ b/sql/log.cc > > @@ -1898,10 +1898,20 @@ static inline int > > binlog_commit_flush_xid_caches(THD *thd, binlog_cache_mngr *cache_mngr, > > bool all, my_xid xid) > > > Xid_log_event end_evt(thd, xid, TRUE); > > - return (binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, > TRUE)); > > + if (!thd->slave_thread && ! thd->user_time.val) > > + { > > + /* > > + Ensure that on the master the event time is the time of commit, > > + not the start of statement time. > > + */ > > + my_hrtime_t hrtime= my_hrtime(); > > + end_evt.when= hrtime_to_my_time(hrtime); > > + } > > + DBUG_RETURN(binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, > TRUE)); > > Ok, so correct me if I'm wrong, but I think the purpose here is: You want > to > have the time when the transaction ended (in case of long-running query). > But for auto-committed query, there is no explicit COMMIT query, so the > thd->start_time is the start of the transaction. So here you ensure that > the > XID_EVENT will have the time at start of commit in either case. If this is > correct, probably could use a comment explaining this. > Right, will do. > > Some comments: > > 1. It is better to check for thd->rgi_slave being non-NULL than for > thd->slave_thread. The thd->rgi_slave is associated with the replication of > some event or transaction. The thd->slave_thread just identifies the > thread/connection. (This is mostly style/code clarity, as in current code > the two values would track each other.) > Sure. > > 2. Why the condition on thd->user_time.val? What is the situation where we > are not replicating an event, but user_time is set and we do not want to > put > the current time in the XID_EVENT? When SET TIMESTAMP is set, the XID_EVENT should use the user value, rather than the current time (as all other events of the transaction will use the SET TIMESTAMP value). > > 3. This handles an XID end event, which is for InnoDB transactions. But > MyISAM/Aria use a different end event, a COMMIT query, and there are are > few > other cases as well. Don't you want consistently to have the end of the > query as the time in all end events? Then this code should probably instead > be in binlog_flush_cache() or perhaps write_transaction_to_binlog()? > Hmm, but I think that would break your point of (4) where there are Query_log_events without end events. So if we tag their `when` to be at binlog time, then when we calculate the end of the Query_log_event on the slave, it would be when + exec_time, which could put us significantly past the real end time of the event. So I think the XID and Query_log events should be treated differently. > > 4. Even given (3), there are still cases (eg. DDL) where there is no end > event, and thus there is no event containing the time for the end of the > operation on the master. Won't this cause some inconsistency in how the > slave determines and displays its timings with this patch? If not, why not? > And to elaborate on the response of (3) for consistency, the Query_log_event's `when` uses thd->start_time, even for the end event. So the end Query_log_event for say a MyISAM/Aria transaction will be consistent with the single Query_log_event of an event without an explicit ending transaction. > > > > for (; has_xid && !entry.need_unlog && ha_info; ha_info= > ha_info->next()) > > { > > if (ha_info->is_started() && ha_info->ht() != binlog_hton && > > !ha_info->ht()->commit_checkpoint_request) > > + { > > entry.need_unlog= true; > > + break; > > + } > > } > > Agree, this code always irritated me ;-) > > You could also move the condition (!entry.need_unlog && ha_info) out of the > loop in an outer if () to not have to check it on every loop iteration > (just > an observation, entirely up to you, it's not something your patch > introduced). > > > > diff --git a/sql/log_event.cc b/sql/log_event.cc > > index 3802f76efe0..706a4903eb0 100644 > > --- a/sql/log_event.cc > > +++ b/sql/log_event.cc > > > @@ -1104,10 +1116,12 @@ Log_event* Log_event::read_log_event(const uchar > *buf, uint event_len, > > switch(event_type) { > > case QUERY_EVENT: > > ev= new Query_log_event(buf, event_len, fdle, QUERY_EVENT); > > + set_orig_exec_time_in_thd(((Query_log_event*) ev)->exec_time); > > break; > > case QUERY_COMPRESSED_EVENT: > > ev= new Query_compressed_log_event(buf, event_len, fdle, > > - QUERY_COMPRESSED_EVENT); > > + QUERY_COMPRESSED_EVENT); > > + set_orig_exec_time_in_thd(((Query_compressed_log_event*) > ev)->exec_time); > > > @@ -964,6 +964,18 @@ Log_event* Log_event::read_log_event(IO_CACHE* file, > > DBUG_RETURN(res); > > } > > > > +/* > > + Update thd->orig_exec_time > > +*/ > > + > > +inline void set_orig_exec_time_in_thd(my_time_t exec_time) > > +{ > > +#if !defined(MYSQL_CLIENT) && defined(HAVE_REPLICATION) > > + THD *thd= current_thd; > > + if (likely(thd)) > > + thd->orig_exec_time= exec_time; > > +#endif > > +} > > Using current_thd doesn't seem right here, and in fact it's not, as seen by > this part of the patch where thd->orig_exec_time has to be set again for > parallel replication: > > > diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc > > index 51ca3b183ec..d279761767a 100644 > > --- a/sql/rpl_parallel.cc > > +++ b/sql/rpl_parallel.cc > > @@ -60,7 +60,10 @@ rpt_handle_event(rpl_parallel_thread::queued_event > *qev, > > rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos; > > strcpy(rgi->future_event_master_log_name, > qev->future_event_master_log_name); > > if (event_can_update_last_master_timestamp(ev)) > > + { > > rgi->last_master_timestamp= ev->when + ev->exec_time; > > + thd->orig_exec_time= ev->exec_time; > > + } > > So let's not set thd->orig_exec_time from Log_event::read_log_event(), it's > redundant (for parallel replication), and logically setting it should be > done when applying the event, not when reading the event. > > Instead, set the thd->orig_exec_time in exec_relay_log_event(), so we > retain > similarity with how replication is done in the parallel and non-parallel > case. There is already a similar if > (event_can_update_last_master_timestamp(ev)) > in exec_relay_log_event(). > > (Also see comments below about moving this from THD to rpl_group_info to > not > increase THD just for this very specific case). > Sure, makes sense (to also use RGI). > > > > + > > +/* > > + Get the time when the event had been executed on the master. > > + This works for both query events and load data events. > > +*/ > > + > > +#if Q_EXEC_TIME_OFFSET != L_EXEC_TIME_OFFSET > > +#error "Q_EXEC_TIME_OFFSET is not same as L_EXEC_TIME_OFFSET" > > +#endif > > + > > +time_t query_event_get_time(const uchar *buf, > > + const Format_description_log_event > > + *description_event) > > +{ > > + time_t when= uint4korr(buf); > > + buf+= description_event->common_header_len; > > + return when + uint4korr(buf + Q_EXEC_TIME_OFFSET); > > Suggestion: Add an assert here: > > DBUG_ASSERT(LOG_EVENT_IS_QUERY(buf[EVENT_TYPE_OFFSET]) || > LOG_EVENT_IS_LOAD_DATA(buf[EVENT_TYPE_OFFSET])); > > just to avoid someone mistakenly adding code later to call this for a wrong > event type. > Sure, good idea. > > > > diff --git a/sql/sql_class.h b/sql/sql_class.h > > index 2d671139a6c..05c9c4e8012 100644 > > --- a/sql/sql_class.h > > +++ b/sql/sql_class.h > > @@ -3168,6 +3168,7 @@ class THD: public THD_count, /* this must be first > */ > > ulonglong start_utime, utime_after_lock, utime_after_query; > > /* This can be used by handlers to send signals to the SQL level */ > > ulonglong replication_flags; > > + my_time_t orig_exec_time; // Exec time for last read binlog > event > > > @@ -1348,8 +1346,14 @@ Query_log_event::Query_log_event(THD* thd_arg, > const char* query_arg, > > memset(&host, 0, sizeof(host)); > > error_code= errcode; > > > > - end_time= my_time(0); > > - exec_time = (ulong) (end_time - thd_arg->start_time); > > + /* > > + For slave threads, remember the original master exec time. > > + This is needed to be able to calculate the master commit time. > > + */ > > + exec_time= ((thd->slave_thread) ? > > + thd->orig_exec_time : > > + (my_time(0) - thd_arg->start_time)); > > + > > I think we can here avoid increasing the THD size, which is sad for a > special case like this. Instead put the orig_exec_time into THD::rgi_slave, > extending the struct rpl_group_info. > > Then the assignment above will be to rgi->orig_exec_time instead of > thd->orig_exec_time, and the check here will be for thd->rgi_slave instead > of thd->slave_thread. The rgi is the appropriate place for orig_exec_time, > as it is a property of the transaction/event group being replicated. > > > @@ -3171,146 +2856,170 @@ static const LEX_CSTRING msg_ignored= { > STRING_WITH_LEN("Ignored") }; > #endif > > > > -static bool send_show_master_info_data(THD *thd, Master_info *mi, bool > full, > > - String *gtid_pos) > > +inline void store_string_or_null(Field **field, const char *str) > > +{ > > > > - mysql_mutex_unlock(&mi->run_lock); > > +void store_master_info(THD *thd, Master_info *mi, TABLE *table, > > > > - protocol->store_string_or_null(mi->host, &my_charset_bin); > > - protocol->store_string_or_null(mi->user, &my_charset_bin); > > - protocol->store((uint32) mi->port); > > - protocol->store((uint32) mi->connect_retry); > > - protocol->store(mi->master_log_name, strlen(mi->master_log_name), > > + store_string_or_null(field++, mi->host); > > + store_string_or_null(field++, mi->user); > > + (*field++)->store((uint32) mi->port); > > + (*field++)->store((uint32) mi->connect_retry); > > + (*field++)->store(mi->master_log_name, strlen(mi->master_log_name), > > So this patch refactors the code for SHOW SLAVE STATUS to use a shared code > path with information_schema tables? That sounds like a good change. > But for the future, please do such refactoring as a separate patch, you can > see my patch series of pre-computing binlog checksums (MDEV-31273) for an > example. Doing it together makes it very hard to review the actual code > changes and see what is refactoring and what is new logic for > master_slave_time_diff. > > > > + if (unlikely(!rli->slave_timestamp) && > Log_event::is_group_event(typ)) > > + { > > + /* > > + First event for this slave. Assume that all the slave was up > to date > > + with the master just before the current event. > > + */ > > + rli->slave_timestamp= (time_t) ev->when + (time_t) > ev->exec_time-1; > > I think the comment is meant to explain why there is the "-1" / minus-one > in > this expression, but it's not really clear exactly what is meant, so > perhaps > elaborate it a bit. > > Does it mean that master_slave_time_diff will start out as "1 second" just > after the slave restarts? > > One way or the other, I do not have a strong opinion. But I want to give a > general remark, from my experience with the troubles over the years with > Seconds_Behind_Master. The problem with Seconds_Behind_Master was always > that it didn't quite do what someone expected in some particular case, and > then a new special case was added to try to accomodate. And after a number > of iterations of this, we ended up with a semantics for > Seconds_Behind_Master that no-one _really_ understands, and which still > doesn't quite always do what people expect (which is in any case impossible > if different people have different expectations). > > This "-1" here is clearly on the same road towards adding special cases. So > my strong advice would be that, going forward, be _extremely_ conservative > about modifying the semantics of master_slave_time_diff. The one good thing > about the new status is that it has a (hopefully) clear semantics that can > be documented (now including this "-1") and completely understood by users. > Let's stick to that no matter who comes and wants just another small tweak, > unless such tweaks actually make the semantics and documentation _simpler_. > > > > + /* > > + replay_log.description_event_for_exec can be null if the slave > thread > > + is getting killed > > + */ > > + if (LOG_EVENT_IS_QUERY((Log_event_type) buf[EVENT_TYPE_OFFSET]) || > > + LOG_EVENT_IS_LOAD_DATA((Log_event_type) buf[EVENT_TYPE_OFFSET])) > > + { > > + time_t exec_time= query_event_get_time(buf, rli->relay_log. > > + > description_event_for_queue); > > + set_if_bigger(rli->newest_master_timestamp, exec_time); > > + } > > This confused me, because in Query_log_event, the exec_time is the duration > of the query (a time interval), not an absolute time. > > So please use the name "end_time" for the variable instead of "exec_time", > and consider renaming query_event_get_time() to query_event_get_end_time(). > Right, makes sense. > > > > > +static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE > *schema_table) > > +{ > > + ST_FIELD_INFO *field_info= schema_table->fields_info; > > + Name_resolution_context *context= > &thd->lex->first_select_lex()->context; > > + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_SHOW_SLAVE_STAT); > > + bool all_slaves= thd->lex->mi.show_all_slaves; > > + ulonglong used_fields= ~0ULL; > > + > > + if (!all_slaves) > > + { > > + /* Remove 2 first fields and all fields above and including field > 55 */ > > + used_fields&= ~((1ULL << 0) | (1ULL << 1)); > > + used_fields&= ((1ULL << 56)-1); > > Should there be some constant used for the magic numbers 55/56? > And/or an assertion that this value is correct? > That'd make sense, sure. > > Also, is the intention here that we will from here on never add more fields > to legacy SHOW SLAVE STATUS? If so, there should probably be a comment > about > this; also see the test case > mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that > might need updating. > Not that I know of, why wouldn't we want to keep them in-sync? > > - Kristian > _______________________________________________ > developers mailing list -- developers@lists.mariadb.org > To unsubscribe send an email to developers-le...@lists.mariadb.org >
_______________________________________________ developers mailing list -- developers@lists.mariadb.org To unsubscribe send an email to developers-le...@lists.mariadb.org