[MariaDB developers] Re: [PATCH] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

Brandon Nesterenko via developers Thu, 14 Mar 2024 08:17:35 -0700

Thanks for the review, Kristian!

I agree with your summary, see my replies to your specific points below.


On Tue, Mar 12, 2024 at 5:29 AM Kristian Nielsen <kniel...@knielsen-hq.org>
wrote:

> > From: Brandon Nesterenko <brandon.nestere...@mariadb.com>
> >
> > When using semi-sync replication with
> > rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the
> > primary can significantly reduce compared to AFTER_SYNC's
> > performance for workloads with many concurrent users executing
> > transactions. This is because all connections on the primary share
>
> Hi Brandon,
>
> Here is my second round review of the MDEV-33551 patch.
>
> I think these comments of mine from first round still need to be addressed:
>
> > > A bit of care must be taken about lifetime of Tranx_node and THD
> > > respectively (in case one goes away but not the other). But the whole
>
> > > > +void Repl_semi_sync_master::await_all_slave_replies()
>
> > > > +    wait_result= cond_timewait(front.thd, &abstime);
> > >
> > > I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready?
> > > Isn't there a possibility that this might go away during the wait?
>
> I tried a simple testcase (attached) which disconnects the session thread
> while await_all_slave_replies() is running during shutdown. As I expected,
> I
> could see the COND_wakeup_ready of the session thread being de-allocated
> while await_all_slave_replies() is still waking up from that condition.
> That
> cannot be correct.
>
> Similarly, it seems clearly possible that a session thread is killed and
> exits while its entry is still present in the m_active_tranxs list, which
> will similarly lead to signalling a de-allocated COND_wakeup_ready.
>
> When a thread aborts the wait before the entry is removed, I think it
> should
> simply look up its entry (can use the m_trx_htb hash table, or even just
> iterate the list) and set its THD to NULL (or remove the entry, whatever is
> simpler).


> For the await_all_slave_replies() problem, I think the whole
> wait-at-shutdown is much too complex as it is now. The whole
> is_thd_awaiting_semisync_ack(), n_threads_awaiting_ack, special-case not
> killing threads waiting for semi-sync-ack, and now introducing
> unmark_thd_awaiting_ack. And it still has the bug waiting on a
> COND_wakeup_ready of another THD that can disappear during the wait.
>
> I think all of that should just go away, the wait in
> await_all_slave_replies() should be done much simpler with a dedicated
> condition variable to wait for. Eg. await_all_slave_replies() just does a
> mysql_cond_wait(COND_binlog_send) under LOCK_binlog until the last binlog
> position is ack'ed, and this condition is then signalled from
> report_reply_binlog() (either always, or await_all_slave_replies() can set
> a
> flag to request signal). Or have the shutdown thread enqueue its THD at the
> end of the m_active_tranxs list. Let's avoid all this complex interaction
> between SHUTDOWN WAIT FOR ALL SLAVES and session threads.
>
> BTW, COND_binlog_send is currently unused in your patch.
>
> Below detailed comments on individual parts of the patch:
>
> > --- a/sql/mysqld.cc
> > +++ b/sql/mysqld.cc
> > @@ -1750,18 +1750,12 @@ static void close_connections(void)
>
> > -  if (shutdown_wait_for_slaves &&
> repl_semisync_master.get_master_enabled())
> > +  if (shutdown_wait_for_slaves &&
> repl_semisync_master.get_master_enabled() &&
> > +      repl_semisync_master.sync_get_master_wait_sessions())
>
> Do you need to check sync_get_master_wait_sessions() > 0 here? Why not wait
> unconditionally for the last binlog position sent?
>

Ah this was an original simplification before I changed where
set_awaiting_semisync_ack
was set, but leaving it so is a bug, as we can have items in the wait list,
yet not be
waiting on the ACK. Even with the fix, I wonder if we should also move
where we increase
the wait_sessions variable to be when we enqueue into Active_tranx
(currently it is in
commit_trx() just before the wait). I suppose it depends on how we want to
define a wait
session. I.e., when we implement group ACK, how do we want to report the
leader actually
waiting for ACK, vs non-leaders waiting on the leader's ACK.


>
> > --- a/sql/semisync_master.cc
> > +++ b/sql/semisync_master.cc
> > @@ -68,6 +68,14 @@ static ulonglong timespec_to_usec(const struct
> timespec *ts)
> >    return (ulonglong) ts->tv_sec * TIME_MILLION + ts->tv_nsec /
> TIME_THOUSAND;
> >  }
> >
> > +int signal_waiting_transaction(THD *waiting_thd, const char
> *binlog_file,
> > +                                my_off_t binlog_pos)
> > +{
> > +  if (waiting_thd->is_awaiting_semisync_ack)
> > +    mysql_cond_signal(&waiting_thd->COND_wakeup_ready);
>
> The check on is_awaiting_semi_sync_ack seems wrong here. Normally, the THD
> will be waiting, otherwise it would not be in the list in the first place,
> so faster with an occasional redundant signal than a conditional always.
> And
> if the THD is not waiting, it's not safe to access it anyway.
>
> Instead, as mentioned above, have a check for thd != NULL here, and let a
> non-waiting thread clear the THD in its entry before it releases the
> LOCK_binlog.
>

Ah, that makes more sense. Will do.


>
> > -int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time)
> > +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec
> *_wait_time)
>
> > +  if (_wait_time == NULL)
> > +  {
> > +    create_timeout(&gen_wait_time, NULL);
> > +    real_wait_time= &gen_wait_time;
> > +  }
> > +  else
> > +  {
> > +    real_wait_time= _wait_time;
> > +  }
>
> Don't pass in booleans that are always hard-coded true or false in the
> caller. Instead create two different functions, one that takes a time_spec,
> and one that does not (the latter creating it and calling the former).
>

Sure.


>
> > @@ -533,7 +559,8 @@ void Repl_semi_sync_master::remove_slave()
> >        Signal transactions waiting in commit_trx() that they do not have
> to
> >        wait anymore.
> >      */
> > -    cond_broadcast();
> > +    m_active_tranxs->clear_active_tranx_nodes(NULL, NULL,
> > +
> signal_waiting_transaction);
>
> This does not compile on my machine (the second NULL is passed to my_off_t
> position), is that a last-minute change that was never tested? Or NULL is
> defined differently on your machine, so it doesn't fail to compile?
>
>
Ah, interesting. With clang it compiles, but with GCC it does not. I'll fix
it to `0`.


> BTW, question about the old code: Why didn't the old code clear the
> m_active_tranxs list? Was it a bug? Or was it cleared elsewhere before (and
> if so where)?
>

It did not clear it, but should have.


>
> > -void Repl_semi_sync_master::await_slave_reply()
> > +void Repl_semi_sync_master::await_all_slave_replies()
>
> > +    The lock is taken per iteration rather than over the whole loop to
> allow
> > +    more opportunity for the user threads to handle their ACKs.
>
> That sounds like a mis-understanding. The lock is always released by
> mysql_cond_timedwait() in each iteration of the loop, no need to
> release/take it again for just a couple instructions.


> > +  while (TRUE)
> > +  {
> > +    lock();
> > +    if (m_active_tranxs->is_empty() || !get_master_enabled() ||
> !is_on())
> > +    {
> > +      unlock();
> > +      break;
> > +    }
> > +    wait_result= m_active_tranxs->for_front(
> > +        [](THD *thd, const char *binlog_file, my_off_t binlog_pos) ->
> int {
> > +          return (thd->is_awaiting_semisync_ack)
> > +                     ? repl_semisync_master.cond_timewait(thd, NULL)
> > +                     : 0;
> > +        });
>
> As I wrote at the top, it is not safe to wait on the COND_wakeup_ready of
> another THD, since that THD might go away during the wait. Instead, let's
> do
> a specific wait for the end position here, either on COND_binlog or by
> enqueing an extra THD at the end.
>

Sure, I'll keep COND_binlog around and use that (the main thread doesn't
have
a THD, and creating one just to enqueue into the list would be more overhead
than keeping COND_binlog).


>
> (And btw, in your code, why wait for the _front_ of the list and not the
> _rear_?)
>

Ah, that was just keeping the original logic from pre-patch (to be somewhat
consistent). It would have made sense to optimize to just be the rear, with
the new
logic.


>
> > diff --git a/sql/semisync_master.h b/sql/semisync_master.h
> > index 99f46869354..1eb690b572b 100644
> > --- a/sql/semisync_master.h
> > +++ b/sql/semisync_master.h
> >
> > +   * The pre_delete_hook parameter is a function pointer that will be
> invoked
> > +   * for each Active_tranx node, in order, from m_trx_front to
> log_file_name,
> > +   * e.g. to signal their wakeup condition.
> Repl_semi_sync_binlog::LOCK_binlog
> > +   * is held while this is invoked.
>
> "from m_trx_front to log_file_name", I don't understand?
> (Typo s/log_file_name/m_trx_rear/ ?)
>
>
Huh. Big typo, my intelli-"sense" must have filled it in.


> > +/*
> > +  Element in Repl_semi_sync_master::wait_queue to preserve the state of
> a
> > +  transaction waiting for an ACK.
> > +*/
> > +typedef struct _semisync_wait_trx {
> > +  const char *binlog_name;
> > +  my_off_t binlog_pos;
> > +  THD *thd;
> > +} semisync_wait_trx_t;
> > +
>
> This is not longer used (should be removed).
>

Right, thanks.


>
> >     * Input:  (the transaction events' ending binlog position)
> >     *  trx_wait_binlog_name -     (IN)  ending position's file name
> >     *  trx_wait_binlog_pos  -     (IN)  ending position's file offset
> > +   *  unmark_thd_awaiting_ack  - (IN)  Whether or not to unmark
> > +   *                             thd->is_awaiting_semisync_ack after
> receiving
> > +   *                             an ACK from the replica. If using
> wait_point
> > +   *                             AFTER_SYNC with binlog_group_commit,
> only the
> > +   *                             last transaction written to binlog in
> the
> > +   *                             group should negate the boolean,
> because the
> > +   *                             same thread (i.e. leader thread) will
> wait for
> > +   *                             all transaction ACKs. Negating it too
> early
> > +   *                             would break SHUTDOWN WAIT FOR ALL
> SLAVES, as
> > +   *                             that is the condition tested to bypass
> killing
> > +   *                             threads in phase 1. In all other
> situations,
> > +   *                             the boolean should be negated
> immediately.
>
> That seems unnecessarily convoluted.
> I think the real problem here is that write_tranx_in_binlog() sets the
> is_awaiting_semi_sync_ack flag on the wrong THD, on the waiter_thd instead
> of the trx_thd.
>
> But once we fix the await_all_slave_replies(), do we even need the
> is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this
> point
> becomes moot anyway.
>

Ah, right, this is actually a bug as is, as when we are in
kill_threads_phase_1 we
test for is_awaiting_semi_sync_ack, which would only be set on the leader
binlog thread.. So then in kill_threads_phase_1, we'd just check if the THD
is in
Active_tranx to see if we should skip the kill.


>
>  - Kristian.
>
>
In summary
 1) Fix close_connections() check from wait_sessions to instead, always
wait for
     the last element in the queue
 2) Fix aborted threads to set their THD to null (with respective NULL thd
checks,
     where appropriate)
 3) Fix await_all_slave_replies() to use COND_binlog (signalled on last
slave
     reply)
 4) Remove THD::is_awaiting_semisync_ack, change kill_threads_phase_1 to
check
     if THD is in Active_tranx.
 5) Fix comment typo s/log_file_name/m_trx_rear/
 6) Remove struct _semisync_wait_trx
 7) Split cond_timewait into two functions, one with a time_spec param, and
one
     without
 8) Fix clear_active_tranx_nodes NULL param


Thanks again,
Brandon

_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

[MariaDB developers] Re: [PATCH] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

Reply via email to