Hi Kristian,

Thanks for the review!

I see your line of thinking; agreed that --skip-slave-errors=all with
optimistic parallel replication remains problematic. Though I think we
should open it as a different JIRA, as the use cases differ, i.e.
MDEV-27512 reports on the serial replica (which is fixed by the current
patch), and the issue you describe is for the parallel replica (and I
likely wouldn't be able to see the fix for the parallel replica through
in time for this release, as I'm on vacation April 21 - May 3). Though
I'm happy to still take on this new issue.

On Thu, Apr 18, 2024 at 2:42 AM Kristian Nielsen <kniel...@knielsen-hq.org>
wrote:

> Hi Brandon,
>
> I only now got to check on MDEV-27512.
>
> This bug seems to be about using --slave_skip_errors=ALL with optimistic
> parallel replication.
>
> I saw your patch for the bug. While the patch by itself is perhaps ok, I
> think it's only fixing a minor thing, and not addressing the real problem,
> let me explain my thinking:
>
> Now, --slave-skip-errors is obviously a dangerous option, but presumably
> the
> use case is to make the slave continue at all costs in case an unexpected
> error occurs, rather than stop the slave and break replication.
>
> But optimistic parallel replication by design can introduce many different
> transient errors due to conflicts, that are then handled by rolling back
> and
> retrying the offending transactions. These errors are *expected*, and they
> do *not* cause slave to stop or replication to break.
>
> However, it appears that --slave_skip_errors also affects such transient
> errors due to optimistic parallel replication, and will cause any such
> transaction to be silently ignored! This must be very wrong, it will cause
> massive replication divergence.
>
> It seems to me this is the real bug here. When an error is encountered
> during optimistic apply of a transaction (is_parallel_retry_error() returns
> true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should
> _not_ be subject to --slave-skip-errors. The transaction should be rolled
> back as normal, and wait_for_prior_commit() done. Then after setting
> rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error,
> --slave-skip-errors can apply.
>
> I put together a quick test that seems to show this behaviour, included
> below. This tests replicates correctly without replication stopping with
> error. But running it with --mysqld=--slave-skip-errors=all, it replicates
> incorrectly, skipping lots of transactions. The test is somewhat
> contrieved,
> but I think it shows the real problem, that --slave-skip-errors can
> randomly
> cause transactions to be skipped or not depending on if optimistic parallel
> replication triggers a matching transient error or not.
>
> So in summary, it looks like there is a real problem here, that optimistic
> parallel replication is not working correctly with --slave-skip-errors,
> transient errors incorrectly causes conflicts to skip transactions rather
> than retrying them. This will cause replication to diverge even when no
> real
> errors occur.
>
> And the patch for MDEV-27512 doesn't seem to address this issue.
>
> So I think MDEV-27512 should be re-opened, what do you think?
>
>  - Kristian.
>
> -----------------------------------------------------------------------
> --source include/have_innodb.inc
> --source include/have_binlog_format_row.inc
> --source include/master-slave.inc
>
> --connection master
> ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
> CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
> INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
>
> --sync_slave_with_master
>
> --source include/stop_slave.inc
> CHANGE MASTER TO master_use_gtid=slave_pos;
> SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;
> SET GLOBAL innodb_lock_wait_timeout= 5;
> SET @old_parallel= @@GLOBAL.slave_parallel_threads;
> SET @old_mode= @@GLOBAL.slave_parallel_mode;
> SET GLOBAL slave_parallel_mode= aggressive;
> SET GLOBAL slave_parallel_threads= 20;
>
> --connection master
> UPDATE t1 SET b=b+1 WHERE a=6;
>
> --disable_query_log
> let $i= 0;
> while ($i < 40) {
>   eval UPDATE t1 SET b=b+1 WHERE a=2;
>   inc $i;
> }
> --enable_query_log
>
> SELECT * FROM t1 ORDER BY a;
> --save_master_pos
>
> --connection slave1
> # Block first worker, and recursively pause all following workers that get
> # temporary errors before they can retry.
> BEGIN;
> SELECT * FROM t1 WHERE a=6 FOR UPDATE;
>
> --connection slave
> # Cause initial row not found error.
> SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2;
>
> --source include/start_slave.inc
>
> --sleep 2
> # Now following workers should be waiting for prior commit before retrying.
> # Remove the row not found error.
> SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7;
>
> --connection slave1
> ROLLBACK;
>
> --connection slave
> --sync_with_master
>
> SELECT * FROM t1 ORDER BY a;
>
> # Cleanup
> --connection slave
> --source include/stop_slave.inc
> SET GLOBAL innodb_lock_wait_timeout= @old_timeout;
> SET GLOBAL slave_parallel_threads= @old_parallel;
> SET GLOBAL slave_parallel_mode= @old_mode;
> --source include/start_slave.inc
>
> --connection default
> DROP TABLE t1;
>
> --source include/rpl_end.inc
>
_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

Reply via email to