[MariaDB developers] Re: [MariaDB commits] [PATCH] MDEV-33856 New definition for Seconds_Behind_Master

Brandon Nesterenko via developers Tue, 23 Jul 2024 10:42:14 -0700

Hi Kristian,

Thanks for the review! I'll be taking over for Monty to address/incorporate
your comments, see my responses below.


On Mon, Jul 22, 2024 at 8:02 AM Kristian Nielsen via developers <
developers@lists.mariadb.org> wrote:

> Hi Monty,
>
> Here is my review of the MDEV-33856 patch:
>
> > commit cc516f4d9913d732065c7abef983df42e5a8bab3
> (origin/bb-11.6-MDEV-33856-seconds_behind_master)
> > Author: Monty <mo...@mariadb.org>
> > Date:   Tue May 14 23:47:59 2024 +0300
> >
> >     MDEV-33856 New definition for Seconds_Behind_Master
>
> If I understand the patch correctly, in introduces a new
> `Master_slave_time_diff` value as an alternative to the traditional
> Seconds_behind_master, with a clearer semantics. And does not change how
> Seconds_behind_master works.
>
> I think this is fine, and can help to preserve backwards compatibility.
> Just
> maybe mention this in the commit message, and/or change the bug title to
> "Alternative definition for Seconds_Behind_Master".
>
>
Sure, will do.


>
> > diff --git a/mysql-test/suite/perfschema/t/relaylog.test
> b/mysql-test/suite/perfschema/t/relaylog.test
> > index 12fc96a8b27..16f72efae98 100644
> > --- a/mysql-test/suite/perfschema/t/relaylog.test
> > +++ b/mysql-test/suite/perfschema/t/relaylog.test
> > @@ -43,7 +43,7 @@ select
> >    if (sum_number_of_bytes_read > 0, "MANY", "NONE") as
> SUM_NUMBER_OF_BYTES_READ,
> >    if (sum_number_of_bytes_write > 0, "MANY", "NONE") as
> SUM_NUMBER_OF_BYTES_WRITE
> >    from performance_schema.file_summary_by_instance
> > -  where file_name like "%master-%" order by file_name;
> > +  where file_name like "%master-%" and file_name not like "%sql/share%"
> order by file_name;
> >
> >  select * from performance_schema.file_summary_by_instance
> >    where file_name like "%slave-%" order by file_name;
> > @@ -96,7 +96,7 @@ sync_slave_with_master;
> >  -- echo "============ Performance schema on slave ============"
> >
> >  select * from performance_schema.file_summary_by_instance
> > -  where file_name like "%master-%" order by file_name;
> > +  where file_name like "%master-%" and file_name not like "%sql/share%"
> order by file_name;
> >
> >  select
> >    substring(file_name, locate("slave-", file_name)) as FILE_NAME,
>
> What is the reason for these changes (adding match on "%sql/share%" ?
> I don't understand the relation to the subject of the patch.
>
>
I'm not immediately sure about this one. I'll investigate, and remove if it
is proven
unnecessary.


>
> > diff --git a/mysql-test/suite/rpl/t/master_last_event_time.inc
> b/mysql-test/suite/rpl/t/master_last_event_time.inc
> > new file mode 100644
> > index 00000000000..0790d557a7b
> > --- /dev/null
> > +++ b/mysql-test/suite/rpl/t/master_last_event_time.inc
> >
> > +--connection server_3
> > +--sleep 1
> > +--source include/start_slave.inc
> > +--source include/sync_with_master_gtid.inc
>
> What is the purpose of --sleep 1 here?
>
> Usually it's wrong to have sleeps in the tests. Sometimes it is correct,
> but
> then there should be a comment explaining why a sleep is necessary (and
> sufficient) here. This sleep looks wrong, since there's a proper wait
> immediately after for the slave to sync up.
>
>
Indeed this is probably left-over from something prior, I'll remove it.


>
> > diff --git a/mysql-test/suite/rpl/t/master_last_event_time_stmt.test
> b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test
> > new file mode 100644
> > index 00000000000..fe0f16a3db3
> > --- /dev/null
> > +++ b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test
> > @@ -0,0 +1,84 @@
> > +#
> > +# Statement specific tests for master_last_event_time
> > +#
> > +--source include/have_binlog_format_statement.inc
> >
> > +# Slave will replicate the above with a row event which will be very
> fast
> > +# compared to the master event.
>
> I don't understand this comment. Why will the slave replicate using a row
> event, when the test case sets binlog_format=statement?
>
>
This whole check should go in the *_row.test version. I'll move it.


>
> > diff --git a/sql/log.cc b/sql/log.cc
> > index 1cc0fa77f08..63d9c756911 100644
> > --- a/sql/log.cc
> > +++ b/sql/log.cc
> > @@ -1898,10 +1898,20 @@ static inline int
> >  binlog_commit_flush_xid_caches(THD *thd, binlog_cache_mngr *cache_mngr,
> >                                 bool all, my_xid xid)
>
> >    Xid_log_event end_evt(thd, xid, TRUE);
> > -  return (binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE,
> TRUE));
> > +  if (!thd->slave_thread && ! thd->user_time.val)
> > +  {
> > +    /*
> > +      Ensure that on the master the event time is the time of commit,
> > +      not the start of statement time.
> > +    */
> > +    my_hrtime_t hrtime= my_hrtime();
> > +    end_evt.when= hrtime_to_my_time(hrtime);
> > +  }
> > +  DBUG_RETURN(binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE,
> TRUE));
>
> Ok, so correct me if I'm wrong, but I think the purpose here is: You want
> to
> have the time when the transaction ended (in case of long-running query).
> But for auto-committed query, there is no explicit COMMIT query, so the
> thd->start_time is the start of the transaction. So here you ensure that
> the
> XID_EVENT will have the time at start of commit in either case. If this is
> correct, probably could use a comment explaining this.
>

Right, will do.


>
> Some comments:
>
> 1. It is better to check for thd->rgi_slave being non-NULL than for
> thd->slave_thread. The thd->rgi_slave is associated with the replication of
> some event or transaction. The thd->slave_thread just identifies the
> thread/connection. (This is mostly style/code clarity, as in current code
> the two values would track each other.)
>

Sure.


>
> 2. Why the condition on thd->user_time.val? What is the situation where we
> are not replicating an event, but user_time is set and we do not want to
> put
> the current time in the XID_EVENT?


When SET TIMESTAMP is set, the XID_EVENT should use the user value,
rather than the current time (as all other events of the transaction will
use the
SET TIMESTAMP value).


>
> 3. This handles an XID end event, which is for InnoDB transactions. But
> MyISAM/Aria use a different end event, a COMMIT query, and there are are
> few
> other cases as well. Don't you want consistently to have the end of the
> query as the time in all end events? Then this code should probably instead
> be in binlog_flush_cache() or perhaps write_transaction_to_binlog()?
>

Hmm, but I think that would break your point of (4) where there are
Query_log_events without end events. So if we tag their `when` to be at
binlog time, then when we calculate the end of the Query_log_event on
the slave, it would be when + exec_time, which could put us significantly
past the real end time of the event. So I think the XID and Query_log events
should be treated differently.


>
> 4. Even given (3), there are still cases (eg. DDL) where there is no end
> event, and thus there is no event containing the time for the end of the
> operation on the master. Won't this cause some inconsistency in how the
> slave determines and displays its timings with this patch? If not, why not?
>

And to elaborate on the response of (3) for consistency, the
Query_log_event's `when` uses thd->start_time, even for the end event.
So the end Query_log_event for say a MyISAM/Aria transaction will be
consistent with the single Query_log_event of an event without an explicit
ending transaction.


>
>
> >    for (; has_xid && !entry.need_unlog && ha_info; ha_info=
> ha_info->next())
> >    {
> >      if (ha_info->is_started() && ha_info->ht() != binlog_hton &&
> >          !ha_info->ht()->commit_checkpoint_request)
> > +    {
> >        entry.need_unlog= true;
> > +      break;
> > +    }
> >    }
>
> Agree, this code always irritated me ;-)
>
> You could also move the condition (!entry.need_unlog && ha_info) out of the
> loop in an outer if () to not have to check it on every loop iteration
> (just
> an observation, entirely up to you, it's not something your patch
> introduced).
>
>
> > diff --git a/sql/log_event.cc b/sql/log_event.cc
> > index 3802f76efe0..706a4903eb0 100644
> > --- a/sql/log_event.cc
> > +++ b/sql/log_event.cc
>
> > @@ -1104,10 +1116,12 @@ Log_event* Log_event::read_log_event(const uchar
> *buf, uint event_len,
> >      switch(event_type) {
> >      case QUERY_EVENT:
> >        ev= new Query_log_event(buf, event_len, fdle, QUERY_EVENT);
> > +      set_orig_exec_time_in_thd(((Query_log_event*) ev)->exec_time);
> >        break;
> >      case QUERY_COMPRESSED_EVENT:
> >        ev= new Query_compressed_log_event(buf, event_len, fdle,
> > -                                          QUERY_COMPRESSED_EVENT);
> > +                                         QUERY_COMPRESSED_EVENT);
> > +      set_orig_exec_time_in_thd(((Query_compressed_log_event*)
> ev)->exec_time);
>
> > @@ -964,6 +964,18 @@ Log_event* Log_event::read_log_event(IO_CACHE* file,
> >    DBUG_RETURN(res);
> >  }
> >
> > +/*
> > +  Update thd->orig_exec_time
> > +*/
> > +
> > +inline void set_orig_exec_time_in_thd(my_time_t exec_time)
> > +{
> > +#if !defined(MYSQL_CLIENT) && defined(HAVE_REPLICATION)
> > +  THD *thd= current_thd;
> > +  if (likely(thd))
> > +    thd->orig_exec_time= exec_time;
> > +#endif
> > +}
>
> Using current_thd doesn't seem right here, and in fact it's not, as seen by
> this part of the patch where thd->orig_exec_time has to be set again for
> parallel replication:
>
> > diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> > index 51ca3b183ec..d279761767a 100644
> > --- a/sql/rpl_parallel.cc
> > +++ b/sql/rpl_parallel.cc
> > @@ -60,7 +60,10 @@ rpt_handle_event(rpl_parallel_thread::queued_event
> *qev,
> >    rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos;
> >    strcpy(rgi->future_event_master_log_name,
> qev->future_event_master_log_name);
> >    if (event_can_update_last_master_timestamp(ev))
> > +  {
> >      rgi->last_master_timestamp= ev->when + ev->exec_time;
> > +    thd->orig_exec_time= ev->exec_time;
> > +  }
>
> So let's not set thd->orig_exec_time from Log_event::read_log_event(), it's
> redundant (for parallel replication), and logically setting it should be
> done when applying the event, not when reading the event.
>
> Instead, set the thd->orig_exec_time in exec_relay_log_event(), so we
> retain
> similarity with how replication is done in the parallel and non-parallel
> case. There is already a similar if
> (event_can_update_last_master_timestamp(ev))
> in exec_relay_log_event().
>
> (Also see comments below about moving this from THD to rpl_group_info to
> not
> increase THD just for this very specific case).
>

Sure, makes sense (to also use RGI).


>
>
> > +
> > +/*
> > +  Get the time when the event had been executed on the master.
> > +  This works for both query events and load data events.
> > +*/
> > +
> > +#if Q_EXEC_TIME_OFFSET != L_EXEC_TIME_OFFSET
> > +#error "Q_EXEC_TIME_OFFSET is not same as L_EXEC_TIME_OFFSET"
> > +#endif
> > +
> > +time_t query_event_get_time(const uchar *buf,
> > +                            const Format_description_log_event
> > +                            *description_event)
> > +{
> > +  time_t when= uint4korr(buf);
> > +  buf+= description_event->common_header_len;
> > +  return when + uint4korr(buf + Q_EXEC_TIME_OFFSET);
>
> Suggestion: Add an assert here:
>
>   DBUG_ASSERT(LOG_EVENT_IS_QUERY(buf[EVENT_TYPE_OFFSET]) ||
>               LOG_EVENT_IS_LOAD_DATA(buf[EVENT_TYPE_OFFSET]));
>
> just to avoid someone mistakenly adding code later to call this for a wrong
> event type.
>

Sure, good idea.


>
>
> > diff --git a/sql/sql_class.h b/sql/sql_class.h
> > index 2d671139a6c..05c9c4e8012 100644
> > --- a/sql/sql_class.h
> > +++ b/sql/sql_class.h
> > @@ -3168,6 +3168,7 @@ class THD: public THD_count, /* this must be first
> */
> >    ulonglong  start_utime, utime_after_lock, utime_after_query;
> >    /* This can be used by handlers to send signals to the SQL level */
> >    ulonglong  replication_flags;
> > +  my_time_t orig_exec_time;         // Exec time for last read binlog
> event
>
> > @@ -1348,8 +1346,14 @@ Query_log_event::Query_log_event(THD* thd_arg,
> const char* query_arg,
> >    memset(&host, 0, sizeof(host));
> >    error_code= errcode;
> >
> > -  end_time= my_time(0);
> > -  exec_time = (ulong) (end_time  - thd_arg->start_time);
> > +  /*
> > +    For slave threads, remember the original master exec time.
> > +    This is needed to be able to calculate the master commit time.
> > +  */
> > +  exec_time= ((thd->slave_thread) ?
> > +              thd->orig_exec_time :
> > +              (my_time(0) - thd_arg->start_time));
> > +
>
> I think we can here avoid increasing the THD size, which is sad for a
> special case like this. Instead put the orig_exec_time into THD::rgi_slave,
> extending the struct rpl_group_info.
>
> Then the assignment above will be to rgi->orig_exec_time instead of
> thd->orig_exec_time, and the check here will be for thd->rgi_slave instead
> of thd->slave_thread. The rgi is the appropriate place for orig_exec_time,
> as it is a property of the transaction/event group being replicated.
>
>
> @@ -3171,146 +2856,170 @@ static const LEX_CSTRING msg_ignored=  {
> STRING_WITH_LEN("Ignored") };
>  #endif
>
>
> > -static bool send_show_master_info_data(THD *thd, Master_info *mi, bool
> full,
> > -                                       String *gtid_pos)
> > +inline void store_string_or_null(Field **field, const char *str)
> > +{
> >
> > -    mysql_mutex_unlock(&mi->run_lock);
> > +void store_master_info(THD *thd, Master_info *mi, TABLE *table,
> >
> > -    protocol->store_string_or_null(mi->host, &my_charset_bin);
> > -    protocol->store_string_or_null(mi->user, &my_charset_bin);
> > -    protocol->store((uint32) mi->port);
> > -    protocol->store((uint32) mi->connect_retry);
> > -    protocol->store(mi->master_log_name, strlen(mi->master_log_name),
> > +  store_string_or_null(field++, mi->host);
> > +  store_string_or_null(field++, mi->user);
> > +  (*field++)->store((uint32) mi->port);
> > +  (*field++)->store((uint32) mi->connect_retry);
> > +  (*field++)->store(mi->master_log_name, strlen(mi->master_log_name),
>
> So this patch refactors the code for SHOW SLAVE STATUS to use a shared code
> path with information_schema tables? That sounds like a good change.
> But for the future, please do such refactoring as a separate patch, you can
> see my patch series of pre-computing binlog checksums (MDEV-31273) for an
> example. Doing it together makes it very hard to review the actual code
> changes and see what is refactoring and what is new logic for
> master_slave_time_diff.
>
>
> > +      if (unlikely(!rli->slave_timestamp) &&
> Log_event::is_group_event(typ))
> > +      {
> > +        /*
> > +          First event for this slave. Assume that all the slave was up
> to date
> > +          with the master just before the current event.
> > +        */
> > +        rli->slave_timestamp= (time_t) ev->when + (time_t)
> ev->exec_time-1;
>
> I think the comment is meant to explain why there is the "-1" / minus-one
> in
> this expression, but it's not really clear exactly what is meant, so
> perhaps
> elaborate it a bit.
>
> Does it mean that master_slave_time_diff will start out as "1 second" just
> after the slave restarts?
>
> One way or the other, I do not have a strong opinion. But I want to give a
> general remark, from my experience with the troubles over the years with
> Seconds_Behind_Master. The problem with Seconds_Behind_Master was always
> that it didn't quite do what someone expected in some particular case, and
> then a new special case was added to try to accomodate. And after a number
> of iterations of this, we ended up with a semantics for
> Seconds_Behind_Master that no-one _really_ understands, and which still
> doesn't quite always do what people expect (which is in any case impossible
> if different people have different expectations).
>
> This "-1" here is clearly on the same road towards adding special cases. So
> my strong advice would be that, going forward, be _extremely_ conservative
> about modifying the semantics of master_slave_time_diff. The one good thing
> about the new status is that it has a (hopefully) clear semantics that can
> be documented (now including this "-1") and completely understood by users.
> Let's stick to that no matter who comes and wants just another small tweak,
> unless such tweaks actually make the semantics and documentation _simpler_.
>
>
> > +    /*
> > +      replay_log.description_event_for_exec can be null if the slave
> thread
> > +      is getting killed
> > +    */
> > +    if (LOG_EVENT_IS_QUERY((Log_event_type) buf[EVENT_TYPE_OFFSET]) ||
> > +        LOG_EVENT_IS_LOAD_DATA((Log_event_type) buf[EVENT_TYPE_OFFSET]))
> > +    {
> > +      time_t exec_time= query_event_get_time(buf, rli->relay_log.
> > +
>  description_event_for_queue);
> > +      set_if_bigger(rli->newest_master_timestamp, exec_time);
> > +    }
>
> This confused me, because in Query_log_event, the exec_time is the duration
> of the query (a time interval), not an absolute time.
>
> So please use the name "end_time" for the variable instead of "exec_time",
> and consider renaming query_event_get_time() to query_event_get_end_time().
>

Right, makes sense.


>
> >
> > +static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE
> *schema_table)
> > +{
> > +  ST_FIELD_INFO *field_info= schema_table->fields_info;
> > +  Name_resolution_context *context=
> &thd->lex->first_select_lex()->context;
> > +  DBUG_ASSERT(thd->lex->sql_command == SQLCOM_SHOW_SLAVE_STAT);
> > +  bool all_slaves= thd->lex->mi.show_all_slaves;
> > +  ulonglong used_fields= ~0ULL;
> > +
> > +  if (!all_slaves)
> > +  {
> > +    /* Remove 2 first fields and all fields above and including field
> 55 */
> > +    used_fields&= ~((1ULL << 0) | (1ULL << 1));
> > +    used_fields&= ((1ULL << 56)-1);
>
> Should there be some constant used for the magic numbers 55/56?
> And/or an assertion that this value is correct?
>

That'd make sense, sure.


>
> Also, is the intention here that we will from here on never add more fields
> to legacy SHOW SLAVE STATUS? If so, there should probably be a comment
> about
> this; also see the test case
> mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that
> might need updating.
>

Not that I know of, why wouldn't we want to keep them in-sync?


>
>  - Kristian
> _______________________________________________
> developers mailing list -- developers@lists.mariadb.org
> To unsubscribe send an email to developers-le...@lists.mariadb.org
>

_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

[MariaDB developers] Re: [MariaDB commits] [PATCH] MDEV-33856 New definition for Seconds_Behind_Master

Reply via email to