Re: Timeline switching with partial WAL records can break replica recovery

Alyona Vinter Tue, 02 Sep 2025 22:53:52 -0700

Hi hackers,

I've found an error in my previous patch and have attached a fixed version.


I'd also like to clarify the timeline switching bug scenario that this
patch fixes:
The issue occurs in this cluster configuration:
[ master ] → [ cascade replica ] → [ replica ]

When the master is lost and the cascade replica is promoted (as described
above), the downstream replica may enter an infinite loop during recovery
instead of properly following the new timeline.
--
Regards,
Alyona Vinter

>

From 6dd413c17424d91ccb3f9df8a95d37ef905c555a Mon Sep 17 00:00:00 2001
From: Alena Vinter <[email protected]>
Date: Sun, 15 Jun 2025 01:29:23 +0700
Subject: [PATCH 2/2] [PGPRO-11992] [FIX] Removed assertion in walsummarizer

Fixes an edge case in the walsummarizer process where fetching InsertTLI
during a timeline switch could lead to latest_lsn < read_upto (which
becomes walrcv->flushedUpto in this case), previously triggering an
assertion failure. We now handle this safely by replacing the assertion
with conditional logic.
---
 src/backend/postmaster/walsummarizer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 11857356ce4..690a3b17e97 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1551,8 +1551,8 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 				if (private_data->tli == latest_tli)
 				{
 					/* Still the current timeline, update max LSN. */
-					Assert(latest_lsn >= private_data->read_upto);
-					private_data->read_upto = latest_lsn;
+					if (latest_lsn >= private_data->read_upto)
+						private_data->read_upto = latest_lsn;
 				}
 				else
 				{
--
2.51.0

From c947c47c95c625520f2eef2183e1f42594bf0c83 Mon Sep 17 00:00:00 2001
From: Alena Vinter <[email protected]>
Date: Mon, 26 May 2025 13:18:35 +0700
Subject: [PATCH 1/2] [PGPRO-11992] [FIX] Handle WAL timeline switches with
 incomplete records

When switching timelines with incomplete WAL records at the end of the
old timeline, physical replicas could enter an infinite recovery loop by
repeatedly requesting the same WAL data. This occurs because the
incomplete record isn't properly marked with XLOG_OVERWRITE_CONTRECORD
record, causing replicas to retry fetching it.

To fix this, we preserve WAL's append-only nature by writing an
XLOG_OVERWRITE_CONTRECORD to explicitly mark incomplete records before
initializing the new timeline. This ensures replicas can properly detect
transition to the new timeline without getting stuck.
---
 src/backend/access/transam/xlog.c         | 169 ++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  14 +-
 2 files changed, 93 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9e5daecd5e..eac83ade6e4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5898,6 +5898,7 @@ StartupXLOG(void)
 	EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
 	abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
 	missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
+	newTLI = endOfRecoveryInfo->lastRecTLI;

 	/*
 	 * Reset ps status display, so as no information related to recovery shows
@@ -5970,67 +5971,6 @@ StartupXLOG(void)
 	 */
 	SetInstallXLogFileSegmentActive();

-	/*
-	 * Consider whether we need to assign a new timeline ID.
-	 *
-	 * If we did archive recovery, we always assign a new ID.  This handles a
-	 * couple of issues.  If we stopped short of the end of WAL during
-	 * recovery, then we are clearly generating a new timeline and must assign
-	 * it a unique new ID.  Even if we ran to the end, modifying the current
-	 * last segment is problematic because it may result in trying to
-	 * overwrite an already-archived copy of that segment, and we encourage
-	 * DBAs to make their archive_commands reject that.  We can dodge the
-	 * problem by making the new active segment have a new timeline ID.
-	 *
-	 * In a normal crash recovery, we can just extend the timeline we were in.
-	 */
-	newTLI = endOfRecoveryInfo->lastRecTLI;
-	if (ArchiveRecoveryRequested)
-	{
-		newTLI = findNewestTimeLine(recoveryTargetTLI) + 1;
-		ereport(LOG,
-				(errmsg("selected new timeline ID: %u", newTLI)));
-
-		/*
-		 * Make a writable copy of the last WAL segment.  (Note that we also
-		 * have a copy of the last block of the old WAL in
-		 * endOfRecovery->lastPage; we will use that below.)
-		 */
-		XLogInitNewTimeline(EndOfLogTLI, EndOfLog, newTLI);
-
-		/*
-		 * Remove the signal files out of the way, so that we don't
-		 * accidentally re-enter archive recovery mode in a subsequent crash.
-		 */
-		if (endOfRecoveryInfo->standby_signal_file_found)
-			durable_unlink(STANDBY_SIGNAL_FILE, FATAL);
-
-		if (endOfRecoveryInfo->recovery_signal_file_found)
-			durable_unlink(RECOVERY_SIGNAL_FILE, FATAL);
-
-		/*
-		 * Write the timeline history file, and have it archived. After this
-		 * point (or rather, as soon as the file is archived), the timeline
-		 * will appear as "taken" in the WAL archive and to any standby
-		 * servers.  If we crash before actually switching to the new
-		 * timeline, standby servers will nevertheless think that we switched
-		 * to the new timeline, and will try to connect to the new timeline.
-		 * To minimize the window for that, try to do as little as possible
-		 * between here and writing the end-of-recovery record.
-		 */
-		writeTimeLineHistory(newTLI, recoveryTargetTLI,
-							 EndOfLog, endOfRecoveryInfo->recoveryStopReason);
-
-		ereport(LOG,
-				(errmsg("archive recovery complete")));
-	}
-
-	/* Save the selected TimeLineID in shared memory, too */
-	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->InsertTimeLineID = newTLI;
-	XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
-	SpinLockRelease(&XLogCtl->info_lck);
-
 	/*
 	 * Actually, if WAL ended in an incomplete record, skip the parts that
 	 * made it through and start writing after the portion that persisted.
@@ -6039,15 +5979,9 @@ StartupXLOG(void)
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
-		/*
-		 * We should only have a missingContrecPtr if we're not switching to a
-		 * new timeline. When a timeline switch occurs, WAL is copied from the
-		 * old timeline to the new only up to the end of the last complete
-		 * record, so there can't be an incomplete WAL record that we need to
-		 * disregard.
-		 */
 		Assert(newTLI == endOfRecoveryInfo->lastRecTLI);
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
+		Assert(EndOfLog == abortedRecPtr);
 		EndOfLog = missingContrecPtr;
 	}

@@ -6104,6 +6038,95 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;

+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		Assert(EndOfLog == missingContrecPtr);
+
+		/*
+		 * Save the aborted record's TimeLineID in shared memory to ensure
+		 * correct writing of the overwrite-contrecord.
+		 *
+		 * These values will be incremented if timeline switch occurs.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		XLogCtl->InsertTimeLineID = newTLI;
+		XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		EndOfLog = CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
+		/*
+		 * Ensure the next records will be written to the next timeline segment
+		 * by closing the current segment.
+		 */
+		if (openLogFile >= 0)
+			XLogFileClose();
+	}
+
+	/*
+	 * Consider whether we need to assign a new timeline ID.
+	 *
+	 * If we did archive recovery, we always assign a new ID.  This handles a
+	 * couple of issues.  If we stopped short of the end of WAL during
+	 * recovery, then we are clearly generating a new timeline and must assign
+	 * it a unique new ID.  Even if we ran to the end, modifying the current
+	 * last segment is problematic because it may result in trying to
+	 * overwrite an already-archived copy of that segment, and we encourage
+	 * DBAs to make their archive_commands reject that.  We can dodge the
+	 * problem by making the new active segment have a new timeline ID.
+	 *
+	 * In a normal crash recovery, we can just extend the timeline we were in.
+	 */
+	if (ArchiveRecoveryRequested)
+	{
+		newTLI = findNewestTimeLine(recoveryTargetTLI) + 1;
+		ereport(LOG,
+				(errmsg("selected new timeline ID: %u", newTLI)));
+
+		/*
+		 * Make a writable copy of the last WAL segment.  (Note that we also
+		 * have a copy of the last block of the old WAL in
+		 * endOfRecovery->lastPage; we will use that below.)
+		 */
+		XLogInitNewTimeline(EndOfLogTLI, EndOfLog, newTLI);
+
+		/*
+		 * Remove the signal files out of the way, so that we don't
+		 * accidentally re-enter archive recovery mode in a subsequent crash.
+		 */
+		if (endOfRecoveryInfo->standby_signal_file_found)
+			durable_unlink(STANDBY_SIGNAL_FILE, FATAL);
+
+		if (endOfRecoveryInfo->recovery_signal_file_found)
+			durable_unlink(RECOVERY_SIGNAL_FILE, FATAL);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby
+		 * servers.  If we crash before actually switching to the new
+		 * timeline, standby servers will nevertheless think that we switched
+		 * to the new timeline, and will try to connect to the new timeline.
+		 * To minimize the window for that, try to do as little as possible
+		 * between here and writing the end-of-recovery record.
+		 */
+		writeTimeLineHistory(newTLI, recoveryTargetTLI,
+							 EndOfLog, endOfRecoveryInfo->recoveryStopReason);
+
+		ereport(LOG,
+				(errmsg("archive recovery complete")));
+	}
+
+	/* Save the selected TimeLineID in shared memory, too */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->InsertTimeLineID = newTLI;
+	XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -6147,16 +6170,6 @@ StartupXLOG(void)
 	/* Shut down xlogreader */
 	ShutdownWalRecovery();

-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
-	}
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..26654fd7a66 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3167,19 +3167,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * of that record.  After recovery is done, we'll write a record
 			 * to indicate to downstream WAL readers that that portion is to
 			 * be ignored.
-			 *
-			 * However, when ArchiveRecoveryRequested = true, we're going to
-			 * switch to a new timeline at the end of recovery. We will only
-			 * copy WAL over to the new timeline up to the end of the last
-			 * complete record, so if we did this, we would later create an
-			 * overwrite contrecord in the wrong place, breaking everything.
 			 */
-			if (!ArchiveRecoveryRequested &&
-				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
-			{
-				abortedRecPtr = xlogreader->abortedRecPtr;
-				missingContrecPtr = xlogreader->missingContrecPtr;
-			}
+			abortedRecPtr = xlogreader->abortedRecPtr;
+			missingContrecPtr = xlogreader->missingContrecPtr;

 			if (readFile >= 0)
 			{
--
2.51.0

Re: Timeline switching with partial WAL records can break replica recovery

Reply via email to