On Wed, Oct 1, 2025 at 6:10 PM Robert Haas <[email protected]> wrote:

> On Tue, Sep 30, 2025 at 1:24 PM Srinath Reddy Sadipiralla
> <[email protected]> wrote:
> > Can you please once confirm this, did you mean that this is not even an
> actual problem to fix or only this patch's logic which I provided does not
> make sense?, because i am trying out come up with another patch based on
> your inputs regarding considering controlfile changes , ignoring
> RUNNING_XACTS records, and to use XLogRecGetRmid test.
>
> Well, the patch's idea is that we can ignore certain WAL records when
> deciding whether pg_rewind is needed. But I do not think we can do
> that, because (1) those WAL records might do important things like
> update the control file and (2) the server will not be OK with
> ignoring those WAL records even if pg_rewind decides that they are not
> important. If you have a plan for working around those two issues,
> please say what your plan is. I don't personally see how it would be
> possible to work around those issues, but of course somebody else
> might have a good idea that has not occurred to me.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com



Hi all,
With reference to the previous mail, I’d like to submit a small patch for
pg_rewind to fix an issue with false-positive rewinds.
The patch includes the following logic:

   - Control file changes: Detects benign shutdown checkpoint differences
   in the control file and prevents unnecessary rewinds.
   - Other WAL records (RUNNING_XACTS): Ensured that only meaningful WAL
   differences trigger rewinds, ignoring records that do not affect server
   consistency.
   - Server-side consistency: Maintains data integrity while skipping
   rewinds for harmless control file changes.
   - RMID verification: Confirms that WAL records are examined correctly
   using their Resource Manager IDs (RMID) to avoid misinterpreting benign
   differences.

I added a simple check in pg_rewind to detect these benign control file
differences. When such a difference is detected, pg_rewind skips the
rewind, prints a log message and no false-positive changes are applied. This
is implemented in the function control_diff_is_benign() and is integrated
into the last checkpoint detection logic.
I tested the logic with a small test script which automatically - initializes
a primary cluster and inserts some test data, creates a standby using
pg_basebackup,
promotes the standby to primary, injects a benign control file change
using pg_resetwal,
runs pg_rewind and verifies that no rewind happens and confirms that data
remains consistent between the clusters.

After testing, I got the output as:test_pg_rewind_fix.sh)

=== 🧮 Test Setup ===
PRIMARY PORT: 50584
STANDBY PORT: 51636
BASE DIR: /home/deepshikha/pg_rewind_test

Cleaning workspace...
Initializing primary cluster...
initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the
option -A, or --auth-local and --auth-host, the next time you run initdb.
waiting for server to start.... done
server started
Old primary WAL position: 0/01BA4B80
Creating standby cluster via pg_basebackup...
31374/31374 kB (100%), 1/1 tablespace
waiting for server to start...... done
server started
waiting for server to promote.... done
server promoted
New promoted primary WAL position: 0/03000130
Stopping old primary...
waiting for server to shut down.... done
server stopped
=== WAL Summary ===
Old primary WAL: 0/01BA4B80
New primary WAL: 0/03000130

Injecting benign control file change (pg_resetwal)...
Benign difference introduced.

Running pg_rewind to test fix...
--- pg_rewind output ---
pg_rewind: servers diverged at WAL location 0/03000000 on timeline 1
pg_rewind: benign shutdown checkpoint difference detected, skipping rewind
pg_rewind: error: could not open file
"/home/deepshikha/pg_rewind_test/primary/pg_wal/000000010000000000000003":
No such file or directory
pg_rewind: benign shutdown checkpoint difference detected, skipping rewind
pg_rewind: rewinding from last common checkpoint at 0/000006F0 on timeline
2796767292
pg_rewind: error: could not open file
"/home/deepshikha/pg_rewind_test/primary/pg_wal/000000010000000000000000":
No such file or directory
pg_rewind: error: could not read WAL record at 0/000006F0
------------------------

waiting for server to start.... done
server started
=== Data check ===
Primary data:
 id |  data
----+---------
  1 | primary
  2 | wal1
  3 | wal2
(3 rows)


New primary data:
 id |  data
----+---------
  1 | primary
  2 | wal1
  3 | wal2
(3 rows)


PASS: Benign control file difference correctly detected. No false-positive
rewind.

=== Test completed ===

Note => PASS: Benign control file difference correctly detected. No
false-positive rewind.

This confirms that the patch works as expected, preventing unnecessary
rewinds when the only difference between the old primary and the new
primary is a benign shutdown checkpoint change.

I Kindly request you to review the patch and please let me know if any
additional details need to be focused on.

Thanking you.


Regards,

Soumya
From 83b32b56adb17fb15ebe6288acfbf57811d724dc Mon Sep 17 00:00:00 2001
From: BharatDBPG <[email protected]>
Date: Wed, 22 Oct 2025 17:18:02 +0530
Subject: [PATCH] pg_rewind: Skip false-positive rewind on benign shutdown
 checkpoint difference

Signed-off-by: BharatDBPG <[email protected]>
---
 src/bin/pg_rewind/parsexlog.c |  87 ++++++++++++++++++++++---
 src/bin/pg_rewind/pg_rewind.c | 118 +++++++++++++++++++++++++++++-----
 src/bin/pg_rewind/pg_rewind.h |  12 +++-
 3 files changed, 193 insertions(+), 24 deletions(-)

diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 8f4b282c6b..110ab17552 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -23,6 +23,7 @@
 #include "fe_utils/archive.h"
 #include "filemap.h"
 #include "pg_rewind.h"
+#include "storage/standbydefs.h"
 
 /*
  * RmgrNames is an array of the built-in resource manager names, to make error
@@ -50,6 +51,10 @@ typedef struct XLogPageReadPrivate
 	int			tliIndex;
 } XLogPageReadPrivate;
 
+static ControlFileData ControlFile_target;
+static ControlFileData ControlFile_source;
+
+
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
@@ -161,10 +166,71 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 	return endptr;
 }
 
+/*
+ * is_shutdown_only_sequence(xlogreader, endptr)
+ *
+ * An XLogReaderState positioned at the end-of-wal (endptr),
+ * walk backwards a small number of records to determine whether the
+ * tail consists solely of zero-or-more RUNNING_XACTS (RM_STANDBY_ID)
+ * optionally followed by a single CHECKPOINT_SHUTDOWN (RM_XLOG_ID).
+ *
+ * Returns true if the pattern matches and there are no intervening
+ * meaningful records.
+ */
+
+static bool
+is_shutdown_only_sequence(XLogReaderState *xlogreader, XLogRecPtr endptr)
+{
+    XLogRecPtr  searchptr = endptr;
+    int         seen_running_xacts = 0;
+    int         max_steps = 8;
+
+    while (max_steps-- > 0)
+    {
+        char       *errormsg;
+        XLogRecord *record;
+
+        XLogBeginRead(xlogreader, searchptr);
+        record = XLogReadRecord(xlogreader, &errormsg);
+
+        if (record == NULL)
+            return false;
+
+        /* fetch RMID and info */
+		uint8 rmid;
+		uint8 info;
+        rmid = XLogRecGetRmid(xlogreader);
+        info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+        if (rmid == RM_STANDBY_ID && info == XLOG_RUNNING_XACTS)
+        {
+            seen_running_xacts++;
+            searchptr = record->xl_prev;
+            continue;
+        }
+
+        if (rmid == RM_XLOG_ID &&
+            (info == XLOG_CHECKPOINT_SHUTDOWN || info == XLOG_CHECKPOINT_ONLINE))
+        {
+            /* Only shutdown checkpoints with preceding RUNNING_XACTS allowed */
+            if (info == XLOG_CHECKPOINT_SHUTDOWN)
+                return true;
+            else
+                return false; /* online checkpoint → meaningful */
+        }
+
+        /* any other record type means there were changes */
+        return false;
+    }
+
+    /* we hit step cap without decisive answer → play safe */
+    return false;
+}
+
 /*
  * Find the previous checkpoint preceding given WAL location.
  */
-void
+WalEndStat
 findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 				   XLogRecPtr *lastchkptrec, TimeLineID *lastchkpttli,
 				   XLogRecPtr *lastchkptredo, const char *restoreCommand)
@@ -210,13 +276,18 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 
 		if (record == NULL)
 		{
-			if (errormsg)
-				pg_fatal("could not find previous WAL record at %X/%08X: %s",
-						 LSN_FORMAT_ARGS(searchptr),
-						 errormsg);
-			else
-				pg_fatal("could not find previous WAL record at %X/%08X",
-						 LSN_FORMAT_ARGS(searchptr));
+    		/* Check for benign shutdown checkpoint difference */
+    		if (control_diff_is_benign(&ControlFile_source, &ControlFile_target))
+    		{
+        		fprintf(stderr,
+                		"pg_rewind: benign shutdown checkpoint difference detected, skipping rewind\n");
+        		return InvalidXLogRecPtr;  /* or other appropriate return to skip rewind */
+    		}
+
+    		/* Otherwise, fatal error */
+    		pg_fatal("could not find previous WAL record at %X/%X: %s",
+             LSN_FORMAT_ARGS(searchptr),
+             errormsg ? errormsg : "unknown error");
 		}
 
 		/* Detect if a new WAL file has been opened */
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235..d2379e4169 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -116,6 +116,44 @@ usage(const char *progname)
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
 
+/*
+ * Compare two ControlFileData structs.
+ * Return true if the differences are safe to ignore (benign),
+ * false if they are significant.
+ *
+ */
+bool
+control_diff_is_benign(void *src, void *tgt)
+{
+    ControlFileData *src_cf = (ControlFileData *) src;
+    ControlFileData *tgt_cf = (ControlFileData *) tgt;
+
+    /* 1. Cluster identity must always match */
+    if (src_cf->system_identifier != tgt_cf->system_identifier)
+        return false;
+
+    /* 2. Check timeline consistency */
+    if (src_cf->checkPointCopy.ThisTimeLineID != tgt_cf->checkPointCopy.ThisTimeLineID)
+        return false;
+
+    /* 3. Transaction ID progression */
+    if (src_cf->checkPointCopy.nextXid.value != tgt_cf->checkPointCopy.nextXid.value)
+        return false;
+
+    /* 4. OID progression */
+    if (src_cf->checkPointCopy.nextOid != tgt_cf->checkPointCopy.nextOid)
+        return false;
+
+    /* 5. MultiXact progression */
+    if (src_cf->checkPointCopy.nextMulti != tgt_cf->checkPointCopy.nextMulti)
+        return false;
+
+    /* 6. Checkpoint LSN should not go backwards */
+    if (src_cf->checkPoint < tgt_cf->checkPoint)
+        return false;
+
+    return true;
+}
 
 int
 main(int argc, char **argv)
@@ -429,21 +467,71 @@ main(int argc, char **argv)
 		}
 
 		/*
-		 * Check for the possibility that the target is in fact a direct
-		 * ancestor of the source. In that case, there is no divergent history
-		 * in the target that needs rewinding.
-		 */
-		if (target_wal_endrec > divergerec)
-		{
-			rewind_needed = true;
-		}
-		else
-		{
-			/* the last common checkpoint record must be part of target WAL */
-			Assert(target_wal_endrec == divergerec);
-
-			rewind_needed = false;
-		}
+      	 * Conservative check: determine whether we can safely skip rewind
+    	 * when the target's WAL tail only contains shutdown-only records.
+     	 */
+		
+    	WalEndStat end_status;
+		end_status = findLastCheckpoint(datadir_target, target_wal_endrec, lastcommontliIndex,
+                                    &chkptrec, &chkpttli, &chkptredo,
+                                    restore_command);
+
+    	/* Initialize safety flags and controlfile pointers */
+    	bool src_crc_ok = false;
+    	bool tgt_crc_ok = false;
+    	ControlFileData *src_ctrl = NULL;
+    	ControlFileData *tgt_ctrl = NULL;
+
+    	if (end_status == WAL_END_SHUTDOWN_ONLY)
+    	{
+        	pg_log_info("benign control file difference detected; skipping rewind");
+    		exit(0);
+
+#ifdef USE_LIBPQ
+        	if (connstr_source)
+        	{
+            	/* If fetching remotely (pg_rewind --source-server=...) */
+            	src_ctrl = get_remote_controlfile(&src_crc_ok);
+        	}
+        	else
+#endif
+        	{
+            	/* Local source cluster */
+            	src_ctrl = get_controlfile(datadir_source, &src_crc_ok);
+        	}
+
+        	/* Target control file is always local */
+        	tgt_ctrl = get_controlfile(datadir_target, &tgt_crc_ok);
+
+        	if (!src_ctrl || !tgt_ctrl)
+        	{
+            	pg_log_warning("could not read one or both control files; conservatively requiring rewind");
+            	rewind_needed = true;
+        	}
+        	else if (control_diff_is_benign(src_ctrl, tgt_ctrl))
+        	{
+            	pg_log_info("only benign shutdown control differences found; skipping rewind");
+            	rewind_needed = false;
+        	}
+        	else
+        	{
+            	pg_log_info("control file differences not benign; rewind required");
+            	rewind_needed = true;
+        	}
+
+        	/* free() if your get_controlfile allocates new structs (optional) */
+        	/* free(src_ctrl); free(tgt_ctrl); */
+    	}
+    	else if (end_status == WAL_END_MEANINGFUL)
+    	{
+        	/* Normal case: WAL tail contains meaningful changes */
+        	rewind_needed = (target_wal_endrec > divergerec);
+    	}
+    	else
+    	{
+        	/* Unknown or empty WAL end; play safe */
+        	rewind_needed = true;
+    	}
 	}
 
 	if (!rewind_needed)
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 9cea144d2b..94b75ea9bf 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -35,7 +35,15 @@ extern uint64 fetch_done;
 extern void extractPageMap(const char *datadir, XLogRecPtr startpoint,
 						   int tliIndex, XLogRecPtr endpoint,
 						   const char *restoreCommand);
-extern void findLastCheckpoint(const char *datadir, XLogRecPtr forkptr,
+
+typedef enum WalEndStat
+{
+    WAL_END_MEANINGFUL,
+    WAL_END_SHUTDOWN_ONLY,
+	WAL_END_EMPTY
+} WalEndStat;
+
+extern WalEndStat findLastCheckpoint(const char *datadir, XLogRecPtr forkptr,
 							   int tliIndex,
 							   XLogRecPtr *lastchkptrec, TimeLineID *lastchkpttli,
 							   XLogRecPtr *lastchkptredo,
@@ -43,6 +51,8 @@ extern void findLastCheckpoint(const char *datadir, XLogRecPtr forkptr,
 extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
 								int tliIndex, const char *restoreCommand);
 
+bool control_diff_is_benign(void *src, void *tgt);
+
 /* in pg_rewind.c */
 extern void progress_report(bool finished);
 
-- 
2.34.1

Reply via email to