There's a race condition between the checkpoint at promotion and pg_rewind. When a server is promoted, the startup process writes an end-of-recovery checkpoint that includes the new TLI, and the server is immediate opened for business. The startup process requests the checkpointer process to perform a checkpoint, but it can take a few seconds or more to complete. If you run pg_rewind, using the just promoted server as the source, pg_rewind will think that the server is still on the old timeline, because it only looks at TLI in the control file's copy of the checkpoint record. That's not updated until the checkpoint is finished.

This isn't a new issue. Stephen Frost first reported it back 2015 [1]. Back then, it was deemed just a small annoyance, and we just worked around it in the tests by issuing a checkpoint command after promotion, to wait for the checkpoint to finish. I just ran into it again today, with the new pg_rewind test, and silenced it in the similar way.

I think we should fix this properly. I'm not sure if it can lead to a broken cluster, but at least it can cause pg_rewind to fail unnecessarily and in a user-unfriendly way. But this is actually pretty simple to fix. pg_rewind looks at the control file to find out the timeline the server is on. When promotion happens, the startup process updates minRecoveryPoint and minRecoveryPointTLI fields in the control file. We just need to read it from there. Patch attached.

I think we should also backpatch this. Back in 2015, we decided that we can live with this, but it's always been a bit bogus, and seems simple enough to fix.

Thoughts?

[1] https://www.postgresql.org/message-id/20150428180253.GU30322%40tamriel.snowman.net

- Heikki
>From 274c59ed0e6573dd387a03e36eda15cb41658447 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Mon, 7 Dec 2020 19:59:50 +0200
Subject: [PATCH 1/1] pg_rewind: Fix determining TLI when server was just
 promoted.

If the source server was just promoted, and it hasn't written the
checkpoint record yet, pg_rewind considered the server to be still on
the old timeline. Because of that, it would claim incorrectly that no
rewind is required. Fix that by looking at the minRecoveryPointTLI in
the control file in addition to the ThisTimeLineID on the checkpoint.

Discussion: TODO
---
 src/bin/pg_rewind/pg_rewind.c                 | 55 +++++++++++--------
 src/bin/pg_rewind/t/007_standby_source.pl     |  1 -
 src/bin/pg_rewind/t/008_min_recovery_point.pl |  9 ---
 src/bin/pg_rewind/t/RewindTest.pm             |  8 ---
 4 files changed, 33 insertions(+), 40 deletions(-)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index d89c08f81d..8eb9b378dc 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -44,6 +44,7 @@ static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
+static TimeLineHistoryEntry *getTimelineHistory(TimeLineID tli, bool is_source, int *nentries);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
 static void ensureCleanShutdown(const char *argv0);
 static void disconnect_atexit(void);
@@ -67,9 +68,11 @@ bool		dry_run = false;
 bool		do_sync = true;
 bool		restore_wal = false;
 
-/* Target history */
+/* Source target timeline histories */
 TimeLineHistoryEntry *targetHistory;
 int			targetNentries;
+static TimeLineHistoryEntry *sourceHistory;
+static int	sourceNentries;
 
 /* Progress counters */
 uint64		fetch_size;
@@ -129,6 +132,8 @@ main(int argc, char **argv)
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
 	XLogRecPtr	chkptredo;
+	TimeLineID	source_tli;
+	TimeLineID	target_tli;
 	XLogRecPtr	target_wal_endrec;
 	size_t		size;
 	char	   *buffer;
@@ -325,14 +330,28 @@ main(int argc, char **argv)
 
 	sanityChecks();
 
+	/*
+	 * Usually, the TLI can be found in the latest checkpoint record. But if
+	 * the source server is just being promoted (or it's a standby that's
+	 * following a primary that's just being promoted), and the checkpoint
+	 * requested by the promotion hasn't completed yet, the latest timeline
+	 * is in minRecoveryPoint. So we check which is later, the TLI of the
+	 * minRecoveryPoint or the latest checkpoint.
+	 */
+	source_tli = Max(ControlFile_source.minRecoveryPointTLI,
+					 ControlFile_source.checkPointCopy.ThisTimeLineID);
+
+	/* Similarly for the target. */
+	target_tli = Max(ControlFile_target.minRecoveryPointTLI,
+					 ControlFile_target.checkPointCopy.ThisTimeLineID);
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (target_tli == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -342,6 +361,10 @@ main(int argc, char **argv)
 	{
 		XLogRecPtr	chkptendrec;
 
+		/* Retrieve timelines for both source and target */
+		sourceHistory = getTimelineHistory(source_tli, true, &sourceNentries);
+		targetHistory = getTimelineHistory(target_tli, false, &targetNentries);
+
 		findCommonAncestorTimeline(&divergerec, &lastcommontliIndex);
 		pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
 					(uint32) (divergerec >> 32), (uint32) divergerec,
@@ -651,7 +674,8 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+			endtli = Max(ControlFile_source_after.checkPointCopy.ThisTimeLineID,
+						 ControlFile_source_after.minRecoveryPointTLI);
 		}
 	}
 	else
@@ -802,12 +826,9 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
  * either source or target.
  */
 static TimeLineHistoryEntry *
-getTimelineHistory(ControlFileData *controlFile, int *nentries)
+getTimelineHistory(TimeLineID tli, bool is_source, int *nentries)
 {
 	TimeLineHistoryEntry *history;
-	TimeLineID	tli;
-
-	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
@@ -828,12 +849,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 		TLHistoryFilePath(path, tli);
 
 		/* Get history file from appropriate source */
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			histfile = source->fetch_file(source, path, NULL);
-		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
 		else
-			pg_fatal("invalid control file");
+			histfile = slurpFile(datadir_target, path, NULL);
 
 		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
 		pg_free(histfile);
@@ -843,12 +862,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 	{
 		int			i;
 
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			pg_log_debug("Source timeline history:");
-		else if (controlFile == &ControlFile_target)
-			pg_log_debug("Target timeline history:");
 		else
-			Assert(false);
+			pg_log_debug("Target timeline history:");
 
 		/*
 		 * Print the target timeline history.
@@ -881,15 +898,9 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 static void
 findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
diff --git a/src/bin/pg_rewind/t/007_standby_source.pl b/src/bin/pg_rewind/t/007_standby_source.pl
index 7a597bf12b..82b9fc39d7 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -80,7 +80,6 @@ $node_b->wait_for_catchup('node_c', 'write', $lsn);
 # A (primary) <--- B (standby)      C (primary)
 
 $node_c->promote;
-$node_c->safe_psql('postgres', "checkpoint");
 
 
 # Insert a row in A. This causes A/B and C to have "diverged", so that it's
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e3d94b3346..60d74a7e1a 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -75,13 +75,6 @@ $node_1->wait_for_catchup('node_3', 'replay', $lsn);
 #
 $node_1->stop('fast');
 $node_3->promote;
-# Force a checkpoint after the promotion. pg_rewind looks at the control
-# file to determine what timeline the server is on, and that isn't updated
-# immediately at promotion, but only at the next checkpoint. When running
-# pg_rewind in remote mode, it's possible that we complete the test steps
-# after promotion so quickly that when pg_rewind runs, the standby has not
-# performed a checkpoint after promotion yet.
-$node_3->safe_psql('postgres', "checkpoint");
 
 # reconfigure node_1 as a standby following node_3
 my $node_3_connstr = $node_3->connstr;
@@ -106,8 +99,6 @@ $lsn = $node_3->lsn('insert');
 $node_3->wait_for_catchup('node_1', 'replay', $lsn);
 
 $node_1->promote;
-# Force a checkpoint after promotion, like earlier.
-$node_1->safe_psql('postgres', "checkpoint");
 
 #
 # We now have a split-brain with two primaries. Insert a row on both to
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 41ed7d4b3b..e2041e6fc0 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -206,14 +206,6 @@ sub promote_standby
 	# the primary out-of-sync with the standby.
 	$node_standby->promote;
 
-	# Force a checkpoint after the promotion. pg_rewind looks at the control
-	# file to determine what timeline the server is on, and that isn't updated
-	# immediately at promotion, but only at the next checkpoint. When running
-	# pg_rewind in remote mode, it's possible that we complete the test steps
-	# after promotion so quickly that when pg_rewind runs, the standby has not
-	# performed a checkpoint after promotion yet.
-	standby_psql("checkpoint");
-
 	return;
 }
 
-- 
2.20.1

Reply via email to