*Dear Srinath,*

*Subject:* [PATCH] pg_rewind: Ignore shutdown checkpoints when determining
rewind necessity.

While working with pg_rewind, I noticed that it can sometimes request a
rewind even when no real changes exist after a failover. This happens
because pg_rewind currently determines the end-of-WAL on the target using
the last shutdown checkpoint (or minRecoveryPoint for a standby). In a
clean failover scenario—where a standby is promoted and the old primary is
later shut down—the only WAL record generated after divergence may be a
shutdown checkpoint. Although the data on both nodes is identical, pg_rewind
treats this shutdown record as meaningful and unnecessarily forces a
rewind. The proposed patch fixes this by ignoring shutdown checkpoints (
XLOG_CHECKPOINT_SHUTDOWN) when determining the end-of-WAL, scanning
backward until a non-shutdown record is found. This ensures that rewinds
are triggered only when actual modifications exist after divergence,
avoiding unnecessary rewinds in clean failover situations.

Also, with the proposed fix implemented in my local script, it gives the
following results:

   -

   Old primary shuts down cleanly.
   -

   Standby is promoted successfully.
   -

   pg_rewind correctly detects no rewind is needed.
   -

   Data on both clusters matches perfectly.

I believe this change will prevent unnecessary rewinds in production
environments, improve reliability, and avoid potential confusion during
failovers.

Thank you for your consideration.

Best regards,
Soumya.


On Sat, Sep 6, 2025 at 10:04 PM Srinath Reddy Sadipiralla <
[email protected]> wrote:

> Hi all,
>
> While working with pg_rewind, I noticed that it can sometimes request a
> rewind even when no actual changes exist after a failover.
>
> *Problem:*
> Currently, pg_rewind determines the end-of-WAL on the target by using the
> last shutdown checkpoint (or minRecoveryPoint for a standby). This creates
> a false positive scenario:
>
> 1)Suppose a standby is promoted to become the new primary.
> 2)Later, the old primary is cleanly shut down.
> 3)The only WAL record generated on the old primary after divergence is a
> shutdown checkpoint.
>
> At this point, the old primary and new primary contain identical data.
> However, since the shutdown checkpoint extends the WAL past the divergence
> point, pg_rewind concludes:
>
> if (target_wal_endrec > divergerec)
>     rewind_needed = true;
>
> That forces a rewind even though there are no meaningful changes.
>
> To *reproduce this scenario* use the below attached script.
>
> *Fix:*
> The attached patch changes the logic so that pg_rewind no longer treats
> shutdown checkpoints as meaningful records when determining the end-of-WAL.
> Instead, we scan backward from the last checkpoint until we find the most
> recent valid WAL record that is not a shutdown-only related record.
>
> This ensures rewind is only triggered when there are actual modifications
> after divergence, avoiding unnecessary rewinds in clean failover scenarios.
>
>
> --
> Thanks,
> Srinath Reddy Sadipiralla
> EDB: https://www.enterprisedb.com/
>

Attachment: run_pg_rewind_with_port.sh
Description: application/shellscript

From 44d0b45aa40d3fb0bc2b4907b277f321417dbc98 Mon Sep 17 00:00:00 2001
From: manimurali1993 <[email protected]>
Date: Tue, 23 Sep 2025 16:24:11 +0530
Subject: [PATCH] Modified the condition to ignore shutdown-only checkpoints to
 avoid pg_rewind false positives caused by shutdown

Signed-off-by: manimurali1993 <[email protected]>
---
 src/bin/pg_rewind/parsexlog.c | 36 ++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 8f4b282c6b1..af1fd474f6c 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -249,13 +249,35 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 			(info == XLOG_CHECKPOINT_SHUTDOWN ||
 			 info == XLOG_CHECKPOINT_ONLINE))
 		{
-			CheckPoint	checkPoint;
-
-			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-			*lastchkptrec = searchptr;
-			*lastchkpttli = checkPoint.ThisTimeLineID;
-			*lastchkptredo = checkPoint.redo;
-			break;
+			if (searchptr < forkptr &&
++           XLogRecGetRmid(xlogreader) == RM_XLOG_ID)
+			{
+				CheckPoint	checkPoint;
+
+				memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
+				*lastchkptrec = searchptr;
+				*lastchkpttli = checkPoint.ThisTimeLineID;
+				*lastchkptredo = checkPoint.redo;
+				break;
+				if (info == XLOG_CHECKPOINT_ONLINE)
+				{
+					CheckPoint   checkPoint;
+
+               		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
+               		*lastchkptrec = searchptr;
+               		*lastchkpttli = checkPoint.ThisTimeLineID;
+               		*lastchkptredo = checkPoint.redo;
+               		break;
+				}
+				else if (info == XLOG_CHECKPOINT_SHUTDOWN)
+				{
+					/*
+					 * Skip shutdown-only checkpoints, since they don't represent
+					 * real divergence. Keep scanning further back.
+					 */
+					continue;
+				}
+			}
 		}
 
 		/* Walk backwards to previous record. */
-- 
2.34.1

Reply via email to