Re: Implement waiting for wal lsn replay: reloaded

Xuneng Zhou Mon, 06 Apr 2026 00:15:49 -0700

On Mon, Apr 6, 2026 at 2:01 PM Xuneng Zhou <[email protected]> wrote:
>
> On Mon, Apr 6, 2026 at 11:26 AM Xuneng Zhou <[email protected]> wrote:
> >
> > Hi Alexander,
> >
> > On Sun, Apr 5, 2026 at 8:31 PM Alexander Korotkov <[email protected]> 
> > wrote:
> > >
> > > On Thu, Jan 29, 2026 at 9:47 AM Alexander Korotkov <[email protected]> 
> > > wrote:
> > > > On Tue, Jan 27, 2026 at 3:14 AM Xuneng Zhou <[email protected]> 
> > > > wrote:
> > > > > Heikki spotted a misplaced wake-up call for replay waiters in
> > > > > PerformWalRecovery. He suggested that the WaitLSNWakeup needs to be
> > > > > invoked immediately after wal record is applied to avoid the potential
> > > > > missed wake-ups when recovery stops/pauses/promotes. It makes sense to
> > > > > me. Please check the attached patch to fix that.
> > > >
> > > > Pushed, thank you!
> > >
> > > I've assembled small patches, which I think worth pushing before v19 FF.
> > >
> > > 1. Avoid syscache lookup in WaitStmtResultDesc().  This is [1] patch,
> > > but I've removed redundant comment in ExecWaitStmt(), which explained
> > > the same as WaitStmtResultDesc() comment.
> > > 2. Use WAIT FOR LSN in wait_for_catchup().  I made the following
> > > changes: fallback to polling on not_in_recovery result instead of
> > > croaking, avoid separate pg_is_in_recovery() query, comment why we
> > > may face the recovery conflict and why it is safe to detect this by
> > > the error string.
> > > 3. A paragraph to the docs about possible recovery conflicts and their 
> > > reason.
> > >
> > > I'm going to push this on Monday if no objections.
> > >
> > > Links.
> > > 1. 
> > > https://www.postgresql.org/message-id/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com
> > > 2. 
> > > https://www.postgresql.org/message-id/CABPTF7X0n%3DR50z2fBpj3EbYYz04Ab0-DHJa%2BJfoAEny62QmUdg%40mail.gmail.com
> >
> > I'll review them shortly.
> >
>
> Here're some comments:
>
> 1) Patch 2 checks for not_in_recovery, but WAIT FOR ... NO_THROW
> returns “not in recovery”, which appears to prevent the
> promoted-standby fallback described in the patch from triggering.
> Instead, it seems to result in a hard failure.
>
> There are some behavior changes that I overlooked earlier:
>
> 2)  Patch 2 changes the helper from "wait until timeout if there is no
> active replication connection" to "connect to the standby immediately
> and fail on ordinary connection failures". This appears to differ from
> the current contract and may affect cases where the standby is down,
> restarting, or not yet accepting SQL.
>
> "If there is no active replication connection from this peer, waits
> until poll_query_until timeout."
>
> 3) In patch 2, we returns success as soon as the standby locally
> reaches the target LSN, but the existing helper is explicitly defined
> in terms of pg_stat_replication ... state = 'streaming' on the
> upstream. So the new path can report success even when there is no
> active replication connection from that peer anymore.
>
> "The replication connection must be in a streaming state."
>
> I’m not sure whether these semantic changes are intended.
>


After some thoughts, these semantic changes appear to be improvements
over the artifacts of polling-based behavior. The pg_stat_replication
... state = 'streaming' is a precondition of polling works rather than
a post-condition check. The retry timeout also has nothing to do with
the edge cases and the state of the standby.

In practice, wait_for_catchup() is called when the standby is expected
to be up. If it's not, one of two things is true:

The standby is briefly restarting -> start() already waits for
readiness before returning, so this is a non-issue.
The standby is genuinely down -> waiting 180 seconds to time out
wastes CI time compared to failing fast.

I’ve addressed point 1 and updated the comments accordingly to reflect
the new behavior. Please check it.


--
Best,
Xuneng

From a44b45214182807f629f16f13a0ad4e55aaab86a Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Mon, 6 Apr 2026 14:25:36 +0800
Subject: [PATCH v4 1/3]  Avoid syscache lookup while building a WAIT FOR tuple
  descriptor

Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building
the result tuple descriptor for the WAIT FOR command. This avoids a syscache
access that could re-establish a catalog snapshot after we've explicitly
released all snapshots before the wait.

Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com
Author: Xuneng Zhou <[email protected]>
Reviewed-by: Alexander Korotkov <[email protected]>
---
 src/backend/commands/wait.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index e64e3be6285..85fcd463b4c 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -335,10 +335,17 @@ WaitStmtResultDesc(WaitStmt *stmt)
 {
 	TupleDesc	tupdesc;
 
-	/* Need a tuple descriptor representing a single TEXT  column */
+	/*
+	 * Need a tuple descriptor representing a single TEXT column.
+	 *
+	 * We use TupleDescInitBuiltinEntry instead of TupleDescInitEntry to avoid
+	 * syscache access. This is important because WaitStmtResultDesc may be
+	 * called after snapshots have been released, and we must not re-establish
+	 * a catalog snapshot which could cause recovery conflicts on a standby.
+	 */
 	tupdesc = CreateTemplateTupleDesc(1);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
-					   TEXTOID, -1, 0);
+	TupleDescInitBuiltinEntry(tupdesc, (AttrNumber) 1, "status",
+							  TEXTOID, -1, 0);
 	TupleDescFinalize(tupdesc);
 	return tupdesc;
 }
-- 
2.51.0

From b6acf00fa2c8cf8f29dbd87b674fc88f99ac916f Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Mon, 6 Apr 2026 14:52:58 +0800
Subject: [PATCH v4 3/3] Document that WAIT FOR may be interrupted by recovery
 conflicts
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a note to the WAIT FOR documentation explaining that sessions
using this command on a standby server may be interrupted by recovery
conflicts.  Some conflicts are unavoidable — for example, replaying
a tablespace drop terminates all backends unconditionally.
---
 doc/src/sgml/ref/wait_for.sgml | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index df72b3327c8..c30fba6f05a 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -253,6 +253,16 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
    timeline.
   </para>
 
+  <para>
+   On a standby server, <command>WAIT FOR</command> sessions may be
+   interrupted by recovery conflicts.  Some recovery conflicts are
+   unavoidable: for example, replaying a tablespace drop resolves
+   conflicts by terminating all backends, regardless of what they are
+   doing.  Applications using <command>WAIT FOR</command> on a standby
+   should be prepared to handle such interruptions, for example by
+   retrying the command or falling back to an alternative mechanism.
+  </para>
+
 </refsect1>
 
  <refsect1>
-- 
2.51.0

From 8a5001f7a443c070fb9247c4d0e6a194b1ee5dc1 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Mon, 6 Apr 2026 14:51:08 +0800
Subject: [PATCH v4 2/3] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

Rather than pre-checking pg_is_in_recovery() on the standby (which
would add an extra round-trip on every call), we issue WAIT FOR LSN
directly and handle the 'not in recovery' result as a signal to fall
back to polling.

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), when the standby has been
promoted, or when WAIT FOR LSN is interrupted by a recovery conflict,
the function falls back to the original polling-based approach using
pg_stat_replication on the upstream.  The recovery conflict fallback
is necessary because some conflicts are unavoidable — for example,
ResolveRecoveryConflictWithTablespace() kills all backends
unconditionally, regardless of what they are doing.

The recovery conflict detection matches the English error message
"conflict with recovery", which is reliable because the test suite
runs with LC_MESSAGES=C.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <[email protected]>
Reviewed-by: Alexander Korotkov <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Alvaro Herrera <[email protected]>
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 103 +++++++++++++++++++++--
 1 file changed, 95 insertions(+), 8 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b44aefb545a..8058c95b429 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3333,11 +3333,8 @@ sub wait_for_event
 
 =item $node->wait_for_catchup(standby_name, mode, target_lsn)
 
-Wait for the replication connection with application_name standby_name until
-its 'mode' replication column in pg_stat_replication equals or passes the
-specified or default target_lsn.  By default the replay_lsn is waited for,
-but 'mode' may be specified to wait for any of sent|write|flush|replay.
-The replication connection must be in a streaming state.
+Wait until the standby identified by standby_name has reached the specified
+or default target_lsn for the given mode (sent|write|flush|replay).
 
 When doing physical replication, the standby is usually identified by
 passing its PostgreSQL::Test::Cluster instance.  When doing logical
@@ -3355,8 +3352,16 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
-If there is no active replication connection from this peer, waits until
-poll_query_until timeout.
+When the standby is passed as a PostgreSQL::Test::Cluster instance and the
+mode is replay, write, or flush, the function uses WAIT FOR LSN on the
+standby for latch-based wakeup instead of polling.  If the standby has been
+promoted, if the session is interrupted by a recovery conflict, or if the
+standby is unreachable, it falls back to polling.
+
+For 'sent' mode, when the standby is passed as a string (e.g., a
+subscription name), or as a fallback from the above, the function polls
+pg_stat_replication on the upstream.  The replication connection must be
+in a streaming state for this path.
 
 Requires that the 'postgres' db exists and is accessible.
 
@@ -3374,10 +3379,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3402,6 +3410,85 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	#
+	# We skip the pg_is_in_recovery() pre-check and just attempt WAIT FOR
+	# LSN directly.  If the standby was promoted, it returns 'not_in_recovery'
+	# and we fall back to polling.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		# Map mode names to WAIT FOR LSN mode names
+		my %mode_map = (
+			'replay' => 'standby_replay',
+			'write' => 'standby_write',
+			'flush' => 'standby_flush',);
+		my $wait_mode = $mode_map{$mode};
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		my $wait_query =
+		  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+
+		# Try WAIT FOR LSN. If it succeeds, we're done. If it returns
+		# 'not_in_recovery' (standby was promoted), fall back to polling.
+		# If the session is interrupted (e.g., killed by recovery conflict),
+		# fall back to polling on the upstream which is immune to standby-
+		# side conflicts.
+		my $output;
+		local $@;
+		my $wait_succeeded = eval {
+			$output = $standby_node->safe_psql('postgres', $wait_query);
+			chomp($output);
+			1;
+		};
+
+		if ($wait_succeeded && $output eq 'success')
+		{
+			print "done\n";
+			return;
+		}
+
+		# 'not in recovery' means the standby was promoted.
+		if ($wait_succeeded && $output eq 'not in recovery')
+		{
+			diag
+			  "WAIT FOR LSN returned 'not in recovery', falling back to polling";
+		}
+		# 'timeout' is a hard failure — no point falling back to polling.
+		elsif ($wait_succeeded)
+		{
+			my $details = $self->safe_psql('postgres',
+				"SELECT * FROM pg_catalog.pg_stat_replication");
+			diag qq(WAIT FOR LSN returned '$output'
+pg_stat_replication on upstream:
+${details});
+			croak
+			  "WAIT FOR LSN '$wait_mode' to '$target_lsn' returned '$output'";
+		}
+		# WAIT FOR LSN was interrupted.  Fall back to polling if this
+		# looks like a recovery conflict.  We match the English error
+		# message "conflict with recovery" which is reliable because the
+		# test suite runs with LC_MESSAGES=C.  Other errors should fail
+		# immediately rather than being masked by a silent fallback.
+		elsif ($@ =~ /conflict with recovery/i)
+		{
+			diag qq(WAIT FOR LSN interrupted, falling back to polling:
+$@);
+		}
+		else
+		{
+			croak "WAIT FOR LSN failed: $@";
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
+	# - When WAIT FOR LSN was interrupted (e.g., killed by a recovery conflict)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

Re: Implement waiting for wal lsn replay: reloaded

Reply via email to