Hi Kristian, Thanks for the review!
I see your line of thinking; agreed that --skip-slave-errors=all with optimistic parallel replication remains problematic. Though I think we should open it as a different JIRA, as the use cases differ, i.e. MDEV-27512 reports on the serial replica (which is fixed by the current patch), and the issue you describe is for the parallel replica (and I likely wouldn't be able to see the fix for the parallel replica through in time for this release, as I'm on vacation April 21 - May 3). Though I'm happy to still take on this new issue. On Thu, Apr 18, 2024 at 2:42 AM Kristian Nielsen <kniel...@knielsen-hq.org> wrote: > Hi Brandon, > > I only now got to check on MDEV-27512. > > This bug seems to be about using --slave_skip_errors=ALL with optimistic > parallel replication. > > I saw your patch for the bug. While the patch by itself is perhaps ok, I > think it's only fixing a minor thing, and not addressing the real problem, > let me explain my thinking: > > Now, --slave-skip-errors is obviously a dangerous option, but presumably > the > use case is to make the slave continue at all costs in case an unexpected > error occurs, rather than stop the slave and break replication. > > But optimistic parallel replication by design can introduce many different > transient errors due to conflicts, that are then handled by rolling back > and > retrying the offending transactions. These errors are *expected*, and they > do *not* cause slave to stop or replication to break. > > However, it appears that --slave_skip_errors also affects such transient > errors due to optimistic parallel replication, and will cause any such > transaction to be silently ignored! This must be very wrong, it will cause > massive replication divergence. > > It seems to me this is the real bug here. When an error is encountered > during optimistic apply of a transaction (is_parallel_retry_error() returns > true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should > _not_ be subject to --slave-skip-errors. The transaction should be rolled > back as normal, and wait_for_prior_commit() done. Then after setting > rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error, > --slave-skip-errors can apply. > > I put together a quick test that seems to show this behaviour, included > below. This tests replicates correctly without replication stopping with > error. But running it with --mysqld=--slave-skip-errors=all, it replicates > incorrectly, skipping lots of transactions. The test is somewhat > contrieved, > but I think it shows the real problem, that --slave-skip-errors can > randomly > cause transactions to be skipped or not depending on if optimistic parallel > replication triggers a matching transient error or not. > > So in summary, it looks like there is a real problem here, that optimistic > parallel replication is not working correctly with --slave-skip-errors, > transient errors incorrectly causes conflicts to skip transactions rather > than retrying them. This will cause replication to diverge even when no > real > errors occur. > > And the patch for MDEV-27512 doesn't seem to address this issue. > > So I think MDEV-27512 should be re-opened, what do you think? > > - Kristian. > > ----------------------------------------------------------------------- > --source include/have_innodb.inc > --source include/have_binlog_format_row.inc > --source include/master-slave.inc > > --connection master > ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; > CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB; > INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6); > > --sync_slave_with_master > > --source include/stop_slave.inc > CHANGE MASTER TO master_use_gtid=slave_pos; > SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout; > SET GLOBAL innodb_lock_wait_timeout= 5; > SET @old_parallel= @@GLOBAL.slave_parallel_threads; > SET @old_mode= @@GLOBAL.slave_parallel_mode; > SET GLOBAL slave_parallel_mode= aggressive; > SET GLOBAL slave_parallel_threads= 20; > > --connection master > UPDATE t1 SET b=b+1 WHERE a=6; > > --disable_query_log > let $i= 0; > while ($i < 40) { > eval UPDATE t1 SET b=b+1 WHERE a=2; > inc $i; > } > --enable_query_log > > SELECT * FROM t1 ORDER BY a; > --save_master_pos > > --connection slave1 > # Block first worker, and recursively pause all following workers that get > # temporary errors before they can retry. > BEGIN; > SELECT * FROM t1 WHERE a=6 FOR UPDATE; > > --connection slave > # Cause initial row not found error. > SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2; > > --source include/start_slave.inc > > --sleep 2 > # Now following workers should be waiting for prior commit before retrying. > # Remove the row not found error. > SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7; > > --connection slave1 > ROLLBACK; > > --connection slave > --sync_with_master > > SELECT * FROM t1 ORDER BY a; > > # Cleanup > --connection slave > --source include/stop_slave.inc > SET GLOBAL innodb_lock_wait_timeout= @old_timeout; > SET GLOBAL slave_parallel_threads= @old_parallel; > SET GLOBAL slave_parallel_mode= @old_mode; > --source include/start_slave.inc > > --connection default > DROP TABLE t1; > > --source include/rpl_end.inc >
_______________________________________________ developers mailing list -- developers@lists.mariadb.org To unsubscribe send an email to developers-le...@lists.mariadb.org