Howdy Kristian.
> Before doing mark_start_commit(), check that there is no pending deadlock > kill. If there is a pending kill, we won't commit (we will abort, roll back, > and retry). Then we should not mark the commit as started, since that could > potentially make the following GCO start too early, before we completed the > commit after the retry. > > This condition could trigger in some corner cases, where InnoDB would take > temporarily table/row locks that are released again immediately, not held > until the transaction commits. This happens with dict_stats updates and > possibly auto-increment locks. > > Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock > kill to be scheduled in the background. But since the blocking locks are > held only temporarily, they can be released before the background kill > happens. This way, the kill can be delayed until after mark_start_commit() > has been called. Thus we need to check the synchronous indication > rgi->killed_for_retry, not just the asynchroneous thd->killed. I think I understood what's going on, also thanks to a verbose Jira ticket description. > > Signed-off-by: Kristian Nielsen <kniel...@knielsen-hq.org> > --- > sql/rpl_parallel.cc | 20 ++++++++++++++++---- > sql/rpl_rli.cc | 17 +++++++++++++++++ > 2 files changed, 33 insertions(+), 4 deletions(-) > > diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc > index 1cfdf96ee3b..9c4222d7817 100644 > --- a/sql/rpl_parallel.cc > +++ b/sql/rpl_parallel.cc > @@ -1450,11 +1450,23 @@ handle_rpl_parallel_thread(void *arg) > after mark_start_commit(), we have to unmark, which has at least a > theoretical possibility of leaving a window where it looks like all > transactions in a GCO have started committing, while in fact one > - will need to rollback and retry. This is not supposed to be > possible > - (since there is a deadlock, at least one transaction should be > - blocked from reaching commit), but this seems a fragile ensurance, > - and there were historically a number of subtle bugs in this area. > + will need to rollback and retry. > + > + Normally this will not happen, since the kill is there to resolve a > + deadlock that is preventing at least one transaction from > proceeding. > + One case it can happen is with InnoDB dict stats update, which can > + temporarily cause transactions to block each other, but locks are > + released immediately, they don't linger until commit. There could > be > + other similar cases, there were historically a number of subtle > bugs > + in this area. > + > + But once we start the commit, we can expect that no new lock > + conflicts will be introduced. So by handling any lingering deadlock > + kill at this point just before mark_start_commit(), we should be > + robust even towards spurious deadlock kills. > */ > + if (rgi->killed_for_retry != rpl_group_info::RETRY_KILL_NONE) > + wait_for_pending_deadlock_kill(thd, rgi); Assuming my understanding, please correct me if anything, I left a question on this block in https://github.com/MariaDB/server/commit/df0c36a354ffde2766ebed2615642c74b79170ad#r145664723 quoted further. ... this guard would not let a victim of an optimistic conflict proceed until it clears out itself it is such a victim. For instance in the binlog sequence like T_1, T_2, D_3 where T_2 is the victim, D_3 is a DDL, the victim T_2 won't anymore release the next gco member D_3 into its execution. But what if at this point T_2 is only going to be marked in rgi->killed_for_retry by T_1? That is it is apparently able, with this patch as well, having ...NONE killed status go through. T_1 would reach this very point later and when that is done before T_2 finds itself killed, T_1 would release D_3, and afterward T_2 would finally see itself KILLed into retry. Cheers, Andrei _______________________________________________ commits mailing list -- commits@lists.mariadb.org To unsubscribe send an email to commits-le...@lists.mariadb.org