Hi Brandon, Andrei, Does one of you want to review this patch for 10.5?
It fixes a real (and serious) issue in parallel replication that is seen as sporadic failure of rpl.rpl_start_alter_4 in Buildbot. This is yet another occurrence of the problem where mark_start_commit() is done too early, this code unfortunately is fragile and has had a number of problems in the past. But I am quite confident about the patch, calling wait_for_pending_deadlock_kill() should be very safe, and should be correct here. The added assertion helps detect the problem more reliably and could also help catch any other remaining issues around mark_start_commit (though hopefully there are no more left now). - Kristian. Kristian Nielsen via commits <comm...@lists.mariadb.org> writes: > Before doing mark_start_commit(), check that there is no pending deadlock > kill. If there is a pending kill, we won't commit (we will abort, roll back, > and retry). Then we should not mark the commit as started, since that could > potentially make the following GCO start too early, before we completed the > commit after the retry. > > This condition could trigger in some corner cases, where InnoDB would take > temporarily table/row locks that are released again immediately, not held > until the transaction commits. This happens with dict_stats updates and > possibly auto-increment locks. > > Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock > kill to be scheduled in the background. But since the blocking locks are > held only temporarily, they can be released before the background kill > happens. This way, the kill can be delayed until after mark_start_commit() > has been called. Thus we need to check the synchronous indication > rgi->killed_for_retry, not just the asynchroneous thd->killed. > > Signed-off-by: Kristian Nielsen <kniel...@knielsen-hq.org> > --- > sql/rpl_parallel.cc | 20 ++++++++++++++++---- > sql/rpl_rli.cc | 17 +++++++++++++++++ > 2 files changed, 33 insertions(+), 4 deletions(-) > > diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc > index 1cfdf96ee3b..9c4222d7817 100644 > --- a/sql/rpl_parallel.cc > +++ b/sql/rpl_parallel.cc > @@ -1450,11 +1450,23 @@ handle_rpl_parallel_thread(void *arg) > after mark_start_commit(), we have to unmark, which has at least a > theoretical possibility of leaving a window where it looks like all > transactions in a GCO have started committing, while in fact one > - will need to rollback and retry. This is not supposed to be > possible > - (since there is a deadlock, at least one transaction should be > - blocked from reaching commit), but this seems a fragile ensurance, > - and there were historically a number of subtle bugs in this area. > + will need to rollback and retry. > + > + Normally this will not happen, since the kill is there to resolve a > + deadlock that is preventing at least one transaction from > proceeding. > + One case it can happen is with InnoDB dict stats update, which can > + temporarily cause transactions to block each other, but locks are > + released immediately, they don't linger until commit. There could > be > + other similar cases, there were historically a number of subtle > bugs > + in this area. > + > + But once we start the commit, we can expect that no new lock > + conflicts will be introduced. So by handling any lingering deadlock > + kill at this point just before mark_start_commit(), we should be > + robust even towards spurious deadlock kills. > */ > + if (rgi->killed_for_retry != rpl_group_info::RETRY_KILL_NONE) > + wait_for_pending_deadlock_kill(thd, rgi); > if (!thd->killed) > { > DEBUG_SYNC(thd, "rpl_parallel_before_mark_start_commit"); > diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc > index 9d4e09a362c..8284b07f7ff 100644 > --- a/sql/rpl_rli.cc > +++ b/sql/rpl_rli.cc > @@ -2518,6 +2518,23 @@ rpl_group_info::unmark_start_commit() > > e= this->parallel_entry; > mysql_mutex_lock(&e->LOCK_parallel_entry); > + /* > + Assert that we have not already wrongly completed this GCO and signalled > + the next one to start, only to now unmark and make the signal invalid. > + This is to catch problems like MDEV-34696. > + > + The error inject rpl_parallel_simulate_temp_err_xid is used to test this > + precise situation, that we handle it gracefully if it somehow occurs in a > + release build. So disable the assert in this case. > + */ > +#ifndef DBUG_OFF > + bool allow_unmark_after_complete= false; > + DBUG_EXECUTE_IF("rpl_parallel_simulate_temp_err_xid", > + allow_unmark_after_complete= true;); > + DBUG_ASSERT(!gco->next_gco || > + gco->next_gco->wait_count > e->count_committing_event_groups || > + allow_unmark_after_complete); > +#endif > --e->count_committing_event_groups; > mysql_mutex_unlock(&e->LOCK_parallel_entry); > } _______________________________________________ developers mailing list -- developers@lists.mariadb.org To unsubscribe send an email to developers-le...@lists.mariadb.org