Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Marco Nenciarini Tue, 17 Mar 2026 02:31:54 -0700

Since this bug dates back to 9.3, the fix will likely need backpatching.
The v2 patch changes the walrcv_identify_system() signature, which would
be an ABI break on stable branches (walrcv_identify_system_fn is a
function pointer in the WalReceiverFunctionsType struct).


Attached is a backpatch-compatible variant that avoids the API change.
Instead of adding a parameter, libpqrcv_identify_system() stores the
flush position in a new global variable (WalRcvIdentifySystemLsn), and
the walreceiver reads it directly.  The fix logic and TAP test are
otherwise identical.

For master I'd still prefer the v2 approach with the extended signature,
since it's cleaner and there's no ABI constraint.

Best regards,
Marco

From 8b3bb1e86177392d8e7772dabe5a61fcf5db069d Mon Sep 17 00:00:00 2001
From: Marco Nenciarini <[email protected]>
Date: Tue, 17 Mar 2026 10:27:05 +0100
Subject: [PATCH] Fix cascading standby reconnect failure after archive
 fallback

A cascading standby could fail to reconnect to its upstream standby
with "requested starting point ... is ahead of the WAL flush position"
after falling back to archive recovery.  This happened because the
walreceiver requests streaming from RecPtr, which can advance past the
upstream's flush position when WAL is restored from archive.

Fix by having the walreceiver check the upstream's current WAL flush
position via IDENTIFY_SYSTEM before issuing START_REPLICATION.
IDENTIFY_SYSTEM already returns this position (as xlogpos), but it
was previously discarded.  If the requested start point exceeds the
upstream's flush position on the same timeline, the walreceiver waits
for wal_retrieve_retry_interval and retries.

To preserve ABI compatibility on back branches, the flush position
from IDENTIFY_SYSTEM is communicated via a new global variable
(WalRcvIdentifySystemLsn) rather than changing the signature of
walrcv_identify_system().

The bug was introduced in PG 9.3 by commit abfd192b1b5, which added
a flush-position check in StartReplication() that rejects requests
ahead of the server's WAL flush position.

Signed-off-by: Marco Nenciarini <[email protected]>
---
 .../libpqwalreceiver/libpqwalreceiver.c       |  14 ++
 src/backend/replication/walreceiver.c         |  37 +++++
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/replication/walreceiver.h         |   7 +
 src/test/recovery/t/053_cascade_reconnect.pl  | 143 ++++++++++++++++++
 5 files changed, 202 insertions(+)
 create mode 100644 src/test/recovery/t/053_cascade_reconnect.pl

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9f04c9ed25d..a6024420394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -452,6 +452,20 @@ libpqrcv_identify_system(WalReceiverConn *conn, TimeLineID *primary_tli)
 						   PQntuples(res), PQnfields(res), 1, 3)));
 	primary_sysid = pstrdup(PQgetvalue(res, 0, 0));
 	*primary_tli = pg_strtoint32(PQgetvalue(res, 0, 1));
+
+	/* Column 2 is the server's current WAL flush position */
+	{
+		uint32		hi,
+					lo;
+
+		if (sscanf(PQgetvalue(res, 0, 2), "%X/%X", &hi, &lo) != 2)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("could not parse WAL location \"%s\"",
+							PQgetvalue(res, 0, 2))));
+		WalRcvIdentifySystemLsn = ((uint64) hi) << 32 | lo;
+	}
+
 	PQclear(res);
 
 	return primary_sysid;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..c211844ce73 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -54,6 +54,7 @@
 #include "access/htup_details.h"
 #include "access/timeline.h"
 #include "access/transam.h"
+#include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
@@ -95,6 +96,12 @@ bool		hot_standby_feedback;
 static WalReceiverConn *wrconn = NULL;
 WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 
+/*
+ * Server's WAL flush position from the last IDENTIFY_SYSTEM call.
+ * Written by libpqwalreceiver, read by walreceiver main loop.
+ */
+XLogRecPtr	WalRcvIdentifySystemLsn = InvalidXLogRecPtr;
+
 /*
  * These variables are used similarly to openLogFile/SegNo,
  * but for walreceiver to write the XLOG. recvFileTLI is the TimeLineID
@@ -338,6 +345,36 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					 errmsg("highest timeline %u of the primary is behind recovery timeline %u",
 							primaryTLI, startpointTLI)));
 
+		/*
+		 * If our requested startpoint is ahead of the primary server's
+		 * current WAL flush position, we cannot start streaming yet.  This
+		 * can happen when a cascading standby has advanced past the upstream
+		 * via archive recovery.  In this case, wait for the upstream to
+		 * catch up before attempting START_REPLICATION, which would
+		 * otherwise fail with "requested starting point is ahead of the WAL
+		 * flush position".
+		 *
+		 * We only perform this check when we're on the same timeline as the
+		 * primary; when timelines differ, let START_REPLICATION handle the
+		 * timeline negotiation.
+		 */
+		if (startpointTLI == primaryTLI &&
+			startpoint > WalRcvIdentifySystemLsn)
+		{
+			ereport(LOG,
+					errmsg("walreceiver requested start point %X/%08X on timeline %u is ahead of the primary server's flush position %X/%08X, waiting",
+						   LSN_FORMAT_ARGS(startpoint), startpointTLI,
+						   LSN_FORMAT_ARGS(WalRcvIdentifySystemLsn)));
+
+			(void) WaitLatch(MyLatch,
+							 WL_EXIT_ON_PM_DEATH | WL_TIMEOUT | WL_LATCH_SET,
+							 wal_retrieve_retry_interval,
+							 WAIT_EVENT_WAL_RECEIVER_UPSTREAM_CATCHUP);
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
 		/*
 		 * Get any missing history files. We do this always, even when we're
 		 * not interested in that timeline, so that if we're promoted to
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4aa864fe3c3..5d7f97015db 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -161,6 +161,7 @@ SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
+WAL_RECEIVER_UPSTREAM_CATCHUP	"Waiting for upstream server WAL flush position to catch up to requested start point."
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to be generated."
 XACT_GROUP_UPDATE	"Waiting for the group leader to update transaction status at transaction end."
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9b9bd916314..daee2944973 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -165,6 +165,13 @@ typedef struct
 
 extern PGDLLIMPORT WalRcvData *WalRcv;
 
+/*
+ * Server's WAL flush position as reported by the last IDENTIFY_SYSTEM call.
+ * Set by walrcv_identify_system(), used by the walreceiver to avoid
+ * requesting streaming from a point ahead of the upstream's flush position.
+ */
+extern PGDLLIMPORT XLogRecPtr WalRcvIdentifySystemLsn;
+
 typedef struct
 {
 	bool		logical;		/* True if this is logical replication stream,
diff --git a/src/test/recovery/t/053_cascade_reconnect.pl b/src/test/recovery/t/053_cascade_reconnect.pl
new file mode 100644
index 00000000000..54db26b709c
--- /dev/null
+++ b/src/test/recovery/t/053_cascade_reconnect.pl
@@ -0,0 +1,143 @@
+
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Test that a cascading standby can reconnect to its upstream standby after
+# advancing past the upstream's WAL flush position via archive recovery.
+#
+# Setup: primary -> standby_a -> standby_b
+# standby_b has both streaming (from standby_a) and restore_command
+# (from primary's archive).
+#
+# When standby_a's walreceiver is stopped and standby_b falls back to
+# archive recovery, standby_b may advance its recovery position past
+# standby_a's replay position.  Previously, standby_b's walreceiver
+# would fail with "requested starting point is ahead of the WAL flush
+# position" when reconnecting to standby_a.
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary with archiving
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+wal_keep_size = 128MB
+));
+$node_primary->start;
+
+# Take backup and create standby_a (streaming from primary, no archive)
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby_a = PostgreSQL::Test::Cluster->new('standby_a');
+$node_standby_a->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby_a->start;
+
+# Wait for standby_a to start streaming
+$node_primary->wait_for_catchup($node_standby_a);
+
+# Take backup from standby_a and create standby_b
+# standby_b streams from standby_a AND restores from primary's archive
+$node_standby_a->backup($backup_name);
+
+my $node_standby_b = PostgreSQL::Test::Cluster->new('standby_b');
+$node_standby_b->init_from_backup($node_standby_a, $backup_name,
+	has_streaming => 1);
+$node_standby_b->enable_restoring($node_primary);
+$node_standby_b->start;
+
+# Generate initial data and wait for full cascade replication
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE test_tab AS SELECT generate_series(1, 1000) AS id");
+$node_primary->wait_for_replay_catchup($node_standby_a);
+$node_standby_a->wait_for_replay_catchup($node_standby_b, $node_primary);
+
+my $result = $node_standby_b->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab");
+is($result, '1000', 'initial data replicated to cascading standby');
+
+# Disconnect standby_a from primary by clearing primary_conninfo.
+# This stops standby_a's walreceiver, so standby_a can no longer receive
+# new WAL.  Its GetStandbyFlushRecPtr() will return only replayPtr.
+$node_standby_a->append_conf('postgresql.conf', "primary_conninfo = ''");
+$node_standby_a->reload;
+
+# Wait for standby_a's walreceiver to stop
+$node_standby_a->poll_query_until('postgres',
+	"SELECT NOT EXISTS (SELECT 1 FROM pg_stat_wal_receiver)")
+  or die "Timed out waiting for standby_a walreceiver to stop";
+
+# Stop standby_b cleanly.  We'll restart it after generating new WAL
+# so it enters the recovery state machine fresh and tries archive first.
+$node_standby_b->stop;
+
+# Generate more WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO test_tab SELECT generate_series(1001, 2000)");
+
+# Force WAL switch and wait for archiving to complete, so that
+# standby_b can find the new WAL in the archive when it starts.
+my $walfile = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn())");
+$node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");
+$node_primary->poll_query_until('postgres',
+	"SELECT '$walfile' <= last_archived_wal FROM pg_stat_archiver")
+  or die "Timed out waiting for WAL archiving";
+
+# Rotate standby_b's log so we can check just the new log output
+$node_standby_b->rotate_logfile;
+
+# Start standby_b.  It will:
+# 1. Read new WAL from primary's archive (XLOG_FROM_ARCHIVE)
+# 2. Advance RecPtr past standby_a's replay position
+# 3. Try streaming from standby_a (XLOG_FROM_STREAM)
+# 4. With the fix: walreceiver detects upstream is behind via
+#    IDENTIFY_SYSTEM and waits instead of failing
+$node_standby_b->start;
+
+# Wait for standby_b to replay the new data from archive
+$node_standby_b->poll_query_until('postgres',
+	"SELECT count(*) >= 2000 FROM test_tab")
+  or die "Timed out waiting for standby_b to replay archived WAL";
+
+$result = $node_standby_b->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab");
+is($result, '2000',
+	'cascading standby replayed new data from archive');
+
+# Verify no "requested starting point is ahead" errors occurred.
+# Before the fix, standby_b's walreceiver would fail with this error
+# when trying to reconnect to standby_a.
+ok( !$node_standby_b->log_contains(
+		"requested starting point .* is ahead of the WAL flush position"),
+	'no "ahead of flush position" errors in standby_b log');
+
+# Now restore standby_a's streaming from primary so it can catch up
+$node_standby_a->enable_streaming($node_primary);
+$node_standby_a->reload;
+
+# Wait for standby_a to catch up with primary
+$node_primary->wait_for_replay_catchup($node_standby_a);
+
+# standby_b's walreceiver should eventually connect to standby_a and
+# resume streaming (once standby_a has caught up past standby_b's position)
+$node_standby_a->poll_query_until('postgres',
+	"SELECT EXISTS (SELECT 1 FROM pg_stat_replication)")
+  or die "Timed out waiting for standby_b to reconnect to standby_a";
+
+# Verify end-to-end cascade streaming works with new data
+$node_primary->safe_psql('postgres',
+	"INSERT INTO test_tab SELECT generate_series(2001, 3000)");
+$node_primary->wait_for_replay_catchup($node_standby_a);
+$node_standby_a->wait_for_replay_catchup($node_standby_b, $node_primary);
+
+$result = $node_standby_b->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab");
+is($result, '3000',
+	'cascade streaming resumes normally after upstream catches up');
+
+done_testing();
-- 
2.47.3

Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Reply via email to