[MariaDB developers] Re: [PATCH] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

Kristian Nielsen via developers Tue, 12 Mar 2024 04:29:49 -0700

> From: Brandon Nesterenko <brandon.nestere...@mariadb.com>
>
> When using semi-sync replication with
> rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the
> primary can significantly reduce compared to AFTER_SYNC's
> performance for workloads with many concurrent users executing
> transactions. This is because all connections on the primary share


Hi Brandon,

Here is my second round review of the MDEV-33551 patch.

I think these comments of mine from first round still need to be addressed:

> > A bit of care must be taken about lifetime of Tranx_node and THD
> > respectively (in case one goes away but not the other). But the whole

> > > +void Repl_semi_sync_master::await_all_slave_replies()

> > > +    wait_result= cond_timewait(front.thd, &abstime);
> > 
> > I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready?
> > Isn't there a possibility that this might go away during the wait?

I tried a simple testcase (attached) which disconnects the session thread
while await_all_slave_replies() is running during shutdown. As I expected, I
could see the COND_wakeup_ready of the session thread being de-allocated
while await_all_slave_replies() is still waking up from that condition. That
cannot be correct.

Similarly, it seems clearly possible that a session thread is killed and
exits while its entry is still present in the m_active_tranxs list, which
will similarly lead to signalling a de-allocated COND_wakeup_ready.

When a thread aborts the wait before the entry is removed, I think it should
simply look up its entry (can use the m_trx_htb hash table, or even just
iterate the list) and set its THD to NULL (or remove the entry, whatever is
simpler).

For the await_all_slave_replies() problem, I think the whole
wait-at-shutdown is much too complex as it is now. The whole
is_thd_awaiting_semisync_ack(), n_threads_awaiting_ack, special-case not
killing threads waiting for semi-sync-ack, and now introducing
unmark_thd_awaiting_ack. And it still has the bug waiting on a
COND_wakeup_ready of another THD that can disappear during the wait.

I think all of that should just go away, the wait in
await_all_slave_replies() should be done much simpler with a dedicated
condition variable to wait for. Eg. await_all_slave_replies() just does a
mysql_cond_wait(COND_binlog_send) under LOCK_binlog until the last binlog
position is ack'ed, and this condition is then signalled from
report_reply_binlog() (either always, or await_all_slave_replies() can set a
flag to request signal). Or have the shutdown thread enqueue its THD at the
end of the m_active_tranxs list. Let's avoid all this complex interaction
between SHUTDOWN WAIT FOR ALL SLAVES and session threads.

BTW, COND_binlog_send is currently unused in your patch.

Below detailed comments on individual parts of the patch:

> --- a/sql/mysqld.cc
> +++ b/sql/mysqld.cc
> @@ -1750,18 +1750,12 @@ static void close_connections(void)

> -  if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled())
> +  if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled() 
> &&
> +      repl_semisync_master.sync_get_master_wait_sessions())

Do you need to check sync_get_master_wait_sessions() > 0 here? Why not wait
unconditionally for the last binlog position sent?

> --- a/sql/semisync_master.cc
> +++ b/sql/semisync_master.cc
> @@ -68,6 +68,14 @@ static ulonglong timespec_to_usec(const struct timespec 
> *ts)
>    return (ulonglong) ts->tv_sec * TIME_MILLION + ts->tv_nsec / TIME_THOUSAND;
>  }
>  
> +int signal_waiting_transaction(THD *waiting_thd, const char *binlog_file,
> +                                my_off_t binlog_pos)
> +{
> +  if (waiting_thd->is_awaiting_semisync_ack)
> +    mysql_cond_signal(&waiting_thd->COND_wakeup_ready);

The check on is_awaiting_semi_sync_ack seems wrong here. Normally, the THD
will be waiting, otherwise it would not be in the list in the first place,
so faster with an occasional redundant signal than a conditional always. And
if the THD is not waiting, it's not safe to access it anyway.

Instead, as mentioned above, have a check for thd != NULL here, and let a
non-waiting thread clear the THD in its entry before it releases the
LOCK_binlog.

> -int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time)
> +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec 
> *_wait_time)

> +  if (_wait_time == NULL)
> +  {
> +    create_timeout(&gen_wait_time, NULL);
> +    real_wait_time= &gen_wait_time;
> +  }
> +  else
> +  {
> +    real_wait_time= _wait_time;
> +  }

Don't pass in booleans that are always hard-coded true or false in the
caller. Instead create two different functions, one that takes a time_spec,
and one that does not (the latter creating it and calling the former).

> @@ -533,7 +559,8 @@ void Repl_semi_sync_master::remove_slave()
>        Signal transactions waiting in commit_trx() that they do not have to
>        wait anymore.
>      */
> -    cond_broadcast();
> +    m_active_tranxs->clear_active_tranx_nodes(NULL, NULL,
> +                                              signal_waiting_transaction);

This does not compile on my machine (the second NULL is passed to my_off_t
position), is that a last-minute change that was never tested? Or NULL is
defined differently on your machine, so it doesn't fail to compile?

BTW, question about the old code: Why didn't the old code clear the
m_active_tranxs list? Was it a bug? Or was it cleared elsewhere before (and
if so where)?

> -void Repl_semi_sync_master::await_slave_reply()
> +void Repl_semi_sync_master::await_all_slave_replies()

> +    The lock is taken per iteration rather than over the whole loop to allow
> +    more opportunity for the user threads to handle their ACKs.

That sounds like a mis-understanding. The lock is always released by
mysql_cond_timedwait() in each iteration of the loop, no need to
release/take it again for just a couple instructions.

> +  while (TRUE)
> +  {
> +    lock();
> +    if (m_active_tranxs->is_empty() || !get_master_enabled() || !is_on())
> +    {
> +      unlock();
> +      break;
> +    }
> +    wait_result= m_active_tranxs->for_front(
> +        [](THD *thd, const char *binlog_file, my_off_t binlog_pos) -> int {
> +          return (thd->is_awaiting_semisync_ack)
> +                     ? repl_semisync_master.cond_timewait(thd, NULL)
> +                     : 0;
> +        });

As I wrote at the top, it is not safe to wait on the COND_wakeup_ready of
another THD, since that THD might go away during the wait. Instead, let's do
a specific wait for the end position here, either on COND_binlog or by
enqueing an extra THD at the end.

(And btw, in your code, why wait for the _front_ of the list and not the
_rear_?)

> diff --git a/sql/semisync_master.h b/sql/semisync_master.h
> index 99f46869354..1eb690b572b 100644
> --- a/sql/semisync_master.h
> +++ b/sql/semisync_master.h
> 
> +   * The pre_delete_hook parameter is a function pointer that will be invoked
> +   * for each Active_tranx node, in order, from m_trx_front to log_file_name,
> +   * e.g. to signal their wakeup condition. 
> Repl_semi_sync_binlog::LOCK_binlog
> +   * is held while this is invoked.

"from m_trx_front to log_file_name", I don't understand?
(Typo s/log_file_name/m_trx_rear/ ?)

> +/*
> +  Element in Repl_semi_sync_master::wait_queue to preserve the state of a
> +  transaction waiting for an ACK.
> +*/
> +typedef struct _semisync_wait_trx {
> +  const char *binlog_name;
> +  my_off_t binlog_pos;
> +  THD *thd;
> +} semisync_wait_trx_t;
> +

This is not longer used (should be removed).

>     * Input:  (the transaction events' ending binlog position)
>     *  trx_wait_binlog_name -     (IN)  ending position's file name
>     *  trx_wait_binlog_pos  -     (IN)  ending position's file offset
> +   *  unmark_thd_awaiting_ack  - (IN)  Whether or not to unmark
> +   *                             thd->is_awaiting_semisync_ack after 
> receiving
> +   *                             an ACK from the replica. If using wait_point
> +   *                             AFTER_SYNC with binlog_group_commit, only 
> the
> +   *                             last transaction written to binlog in the
> +   *                             group should negate the boolean, because the
> +   *                             same thread (i.e. leader thread) will wait 
> for
> +   *                             all transaction ACKs. Negating it too early
> +   *                             would break SHUTDOWN WAIT FOR ALL SLAVES, as
> +   *                             that is the condition tested to bypass 
> killing
> +   *                             threads in phase 1. In all other situations,
> +   *                             the boolean should be negated immediately.

That seems unnecessarily convoluted.
I think the real problem here is that write_tranx_in_binlog() sets the
is_awaiting_semi_sync_ack flag on the wrong THD, on the waiter_thd instead
of the trx_thd.

But once we fix the await_all_slave_replies(), do we even need the
is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point
becomes moot anyway.

 - Kristian.

--source include/have_innodb.inc
--source include/have_debug.inc
--source include/have_binlog_format_mixed.inc
--source include/master-slave.inc

--connection master
CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1,0);
--sync_slave_with_master

source include/stop_slave.inc;
SET @old_semi_enabled= @@GLOBAL.rpl_semi_sync_slave_enabled;
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
--source include/start_slave.inc

--connection master
SET GLOBAL rpl_semi_sync_master_enabled= 1;
SET GLOBAL rpl_semi_sync_master_timeout= 5000;

let $status_var= Rpl_semi_sync_master_clients;
let $status_var_value= 1;
source include/wait_for_status_var.inc;

show status like 'Rpl_semi_sync_master_status';
show status like 'Rpl_semi_sync_master_clients';


--connection slave
SET @old_dbug= @@GLOBAL.debug_dbug;
SET GLOBAL debug_dbug="+d,simulate_delay_semisync_slave_reply";

--connect(con1, localhost, root,,)
--connect(con2, localhost, root,,)

--connection con1
send INSERT INTO t1 VALUES (2,0);
--sleep 0.3
--disconnect con1

--connection default
--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
wait
EOF
#let $status_var= Rpl_semi_sync_master_wait_sessions;
#let $status_var_value= 1;
#source include/wait_for_status_var.inc;
--sleep 0.1

--connection con2
SHUTDOWN WAIT FOR ALL SLAVES;
--disconnect con2

--connection default
--source include/wait_until_disconnected.inc

--connection master
--source include/wait_until_disconnected.inc

--connection server_1
--source include/wait_until_disconnected.inc

--connection default
--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
restart
EOF
--enable_reconnect
--source include/wait_until_connected_again.inc

--connection master
--enable_reconnect
--source include/wait_until_connected_again.inc

--connection server_1
--enable_reconnect
--source include/wait_until_connected_again.inc

--save_master_pos
--connection slave
SET GLOBAL debug_dbug= @old_dbug;
SET GLOBAL rpl_semi_sync_slave_enabled= @old_semi_enabled;
--source include/stop_slave.inc
--source include/start_slave.inc
--sync_with_master

--connection default
DROP TABLE t1;
--source include/rpl_end.inc

_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

[MariaDB developers] Re: [PATCH] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

Reply via email to