On Mon, May 14, 2018 at 01:14:22PM +0530, Pavan Deolasee wrote: > Looks like I didn't understand Alvaro's comment when he mentioned it to me > off-list. But I now see what Michael and Alvaro mean and that indeed seems > like a problem. I was thinking that the test for (ControlFile->state == > DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated > after the standby is promoted. While that's true for a DB_IN_PRODUCTION, the > RestartPoint may finish after we have written end-of-recovery record, but > before we're in production and thus the minRecoveryPoint may again be set.
Yeah, this has been something I considered as well first, but I was not confident enough that setting up minRecoveryPoint to InvalidXLogRecPtr was actually a safe thing for timeline switches. So I have spent a good portion of today testing and playing with it to be confident enough that this was right, and I have finished with the attached. The patch adds a new flag to XLogCtl which marks if the control file has been updated after the end-of-recovery record has been written, so as minRecoveryPoint does not get updated because of a restart point running in parallel. I have also reworked the test case you sent, removing the manuals sleeps and replacing them with correct wait points. There is also no point to wait after promotion as pg_ctl promote implies a wait. Another important thing is that you need to use wal_log_hints = off to see a crash, which is something that allows_streaming actually enables. Comments are welcome. -- Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index adbd6a2126..230409ced6 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -701,6 +701,15 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* + * Track if end-of-recovery record has been written and that the + * control file has been updated, so as the minimum recovery LSN + * does not get updated anymore. This way, WAL is replayed up to + * the end if a crash happens before the first post-end-recovery + * checkpoint. + */ + bool endOfRecoveryDone; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -2707,12 +2716,25 @@ XLogGetReplicationSlotMinimumLSN(void) static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) { + bool endOfRecoveryDone; + /* Quick check using our local copy of the variable */ if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint)) return; LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + /* + * If the end of recovery record has already been written and that + * the control file has been accordingly updated, then minRecoveryPoint + * should not be updated anymore. + */ + SpinLockAcquire(&XLogCtl->info_lck); + endOfRecoveryDone = XLogCtl->endOfRecoveryDone; + SpinLockRelease(&XLogCtl->info_lck); + if (endOfRecoveryDone) + return; + /* update local copy */ minRecoveryPoint = ControlFile->minRecoveryPoint; minRecoveryPointTLI = ControlFile->minRecoveryPointTLI; @@ -7032,6 +7054,7 @@ StartupXLOG(void) XLogCtl->recoveryLastXTime = 0; XLogCtl->currentChunkStartTime = 0; XLogCtl->recoveryPause = false; + XLogCtl->endOfRecoveryDone = false; SpinLockRelease(&XLogCtl->info_lck); /* Also ensure XLogReceiptTime has a sane value */ @@ -9040,10 +9063,21 @@ CreateEndOfRecoveryRecord(void) /* * Update the control file so that crash recovery can follow the timeline * changes to this point. + * + * Set minRecoveryPoint to InvalidXLogRecPtr so that a crash will force + * redo recovery to the end of WAL. Otherwise a crash immediately after + * promotion may lead to recovery to an inconsistent point and in the worst + * case, recovery failing because of invalid page references. + * + * endOfRecoveryDone is updated to reflect the fact that minRecoveryPoint + * cannot be updated anymore from this point. */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + SpinLockAcquire(&XLogCtl->info_lck); + XLogCtl->endOfRecoveryDone = true; + SpinLockRelease(&XLogCtl->info_lck); ControlFile->time = (pg_time_t) time(NULL); - ControlFile->minRecoveryPoint = recptr; + ControlFile->minRecoveryPoint = InvalidXLogRecPtr; ControlFile->minRecoveryPointTLI = ThisTimeLineID; UpdateControlFile(); LWLockRelease(ControlFileLock); @@ -9137,6 +9171,7 @@ CreateRestartPoint(int flags) CheckPoint lastCheckPoint; XLogRecPtr PriorRedoPtr; TimestampTz xtime; + bool endOfRecoveryDone; /* * Acquire CheckpointLock to ensure only one restartpoint or checkpoint @@ -9244,6 +9279,9 @@ CreateRestartPoint(int flags) * we get here after the end-of-recovery checkpoint. */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + SpinLockAcquire(&XLogCtl->info_lck); + endOfRecoveryDone = XLogCtl->endOfRecoveryDone; + SpinLockRelease(&XLogCtl->info_lck); if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && ControlFile->checkPointCopy.redo < lastCheckPoint.redo) { @@ -9261,8 +9299,12 @@ CreateRestartPoint(int flags) * at a minimum. Note that for an ordinary restart of recovery there's * no value in having the minimum recovery point any earlier than this * anyway, because redo will begin just after the checkpoint record. + * It could be possible that minRecoveryPoint has already been updated + * when generating the end-of-recovery record, in which case we want + * WAL replay to happen up to the end. */ - if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr) + if (!endOfRecoveryDone && + ControlFile->minRecoveryPoint < lastCheckPointEndPtr) { ControlFile->minRecoveryPoint = lastCheckPointEndPtr; ControlFile->minRecoveryPointTLI = lastCheckPoint.ThisTimeLineID; diff --git a/src/test/recovery/t/015_promotion.pl b/src/test/recovery/t/015_promotion.pl new file mode 100644 index 0000000000..2f932d5acb --- /dev/null +++ b/src/test/recovery/t/015_promotion.pl @@ -0,0 +1,83 @@ +# Test for promotion handling with WAL records generated post-promotion +# before the first checkpoint is generated. This test case checks for +# invalid page references at replay based on the minimum consistent +# recovery point defined. +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 1; + +# Initialize primary node +my $alpha = get_new_node('alpha'); +$alpha->init(allows_streaming => 1); +# Setting wal_log_hints to off is important to get invalid page +# references. +$alpha->append_conf("postgresql.conf", <<EOF); +wal_log_hints = off +EOF + +# Start the primary +$alpha->start; + +# setup/start a standby +$alpha->backup('bkp'); +my $bravo = get_new_node('bravo'); +$bravo->init_from_backup($alpha, 'bkp', has_streaming => 1); +$bravo->append_conf('postgresql.conf', <<EOF); +checkpoint_timeout=1h +checkpoint_completion_target=0.9 +EOF +$bravo->start; + +# Dummy table for the upcoming tests. +$alpha->safe_psql('postgres', 'create table test1 (a int)'); +$alpha->safe_psql('postgres', 'insert into test1 select generate_series(1, 10000)'); + +# take a checkpoint +$alpha->safe_psql('postgres', 'checkpoint'); + +# The following vacuum will set visibility map bits and create +# problematic WAL records. +$alpha->safe_psql('postgres', 'vacuum verbose test1'); +# Wait for last record to have been replayed on the standby. +$alpha->wait_for_catchup($bravo, 'replay', + $alpha->lsn('insert')); + +# Now force a checkpoint on the standby. This seems unnecessary but for "some" +# reason, the previous checkpoint on the primary does not reflect on the standby +# and without an explicit checkpoint, it may start redo recovery from a much +# older point, which includes even create table and initial page additions. +$bravo->safe_psql('postgres', 'checkpoint'); + +# Now just use a dummy table and run some operations to move minRecoveryPoint +# beyond the previous vacuum. +$alpha->safe_psql('postgres', 'create table test2 (a int, b text)'); +$alpha->safe_psql('postgres', 'insert into test2 select generate_series(1,10000), md5(random()::text)'); +$alpha->safe_psql('postgres', 'truncate test2'); + +# Do the promotion, which reinitializes minRecoveryPoint in the control +# file so as WAL is replayed up to the end. +$bravo->promote; + +# Truncate the table on the promoted standby, vacuum and extend it +# again to create new page references. The first post-recovery checkpoint +# has not happened yet. +$bravo->safe_psql('postgres', 'truncate test1'); +$bravo->safe_psql('postgres', 'vacuum verbose test1'); +$bravo->safe_psql('postgres', 'insert into test1 select generate_series(1,1000)'); + +# Now crash-stop the promoted standby and restart. This makes sure that +# replay does not see invalid page references because of an invalid +# minimum consistent recovery point. +$bravo->stop('immediate'); +$bravo->start; + +# Check state of the table after full crash recovery. All its data should +# be here. +my $psql_out; +$bravo->psql( + 'postgres', + "SELECT count(*) FROM test1", + stdout => \$psql_out); +is($psql_out, '1000', "Check that table state is correct");
signature.asc
Description: PGP signature