On Wed, Apr 8, 2026 at 7:23 AM Alexander Korotkov <[email protected]> wrote:
>
> Hi, Xuneng!
>
> > Here is some analysis of the issue reported by Tom:
> >
> > 1) The problem
> >
> > WAIT FOR LSN with standby_write or standby_flush mode can block
> > indefinitely on an idle primary even when the target LSN is already
> > satisfied by WAL on disk.
> >
> > The walreceiver initializes its process-local LogstreamResult.Write
> > and LogstreamResult.Flush from GetXLogReplayRecPtr() at connect time,
> > reflecting all WAL already present on the standby (from a base backup,
> > archive restore, or prior streaming). The shared-memory positions used
> > by WAIT FOR LSN, however, are not seeded from this value:
> >
> > WalRcv->writtenUpto is zero-initialized by ShmemInitStruct and remains
> > 0 until XLogWalRcvWrite() processes incoming streaming data.
> > WalRcv->flushedUpto is initialized to the segment-aligned streaming
> > start point by RequestXLogStreaming(), which may be significantly
> > behind the replay position. It advances only when XLogWalRcvFlush()
> > processes new data — which itself requires LogstreamResult.Flush <
> > LogstreamResult.Write, a condition that never holds at startup since
> > both fields are initialized to the same value.
> >
> > When the primary is idle and sends no new WAL, both positions stay at
> > their initial stale values indefinitely.
> >
> > 2) The fix
> > Seed writtenUpto and flushedUpto from LogstreamResult immediately
> > after the walreceiver initializes those process-local fields, then
> > call WaitLSNWakeup() to wake any already-blocked waiters.
> >
> > This broadens the semantics of these fields. writtenUpto and
> > flushedUpto  used to track only WAL written or flushed by the current
> > walreceiver session — WAL received from the primary since the most
> > recent connect. After this change, they are initialized to the replay
> > position, so they also cover WAL that was already on disk before
> > streaming began. This affects pg_stat_wal_receiver.written_lsn and
> > flushed_lsn, which will now report the replay position immediately at
> > walreceiver startup rather than 0 and the segment boundary
> > respectively. I am still considering whether this semantic change is
> > acceptable though it does shorten the runtime of the tap tests
> > reported by Tom in my test. Another approach is to modify the logic of
> > GetCurrentLSNForWaitType to cope with this special case and leave the
> > publisher side alone without changing the semantics. But this seems to
> > be more subtle.
>
> Patch 0001 looks OK for me.
> Regarding patch 0002.  Changes made for GetCurrentLSNForWaitType()
> looks reliable for me.  PerformWalRecovery() sets replayed positions
> before starting recovery, and in turn before standby can accept
> connections.  So, changes to WalReceiverMain() don't look necessary to
> me.

Yeah, GetCurrentLSNForWaitType seems to be the right place to place
the fix. Please see the attached patch 2.

I also noticed another relevent problem:

During pure archive recovery (no walreceiver), a backend that issues
'WAIT FOR LSN ... MODE 'standby_write' with a target ahead of the
current replay position will sleep forever; the startup process
replays past the target but only wakes 'STANDBY_REPLAY' waiters.

This also affects mixed scenarios: the walreceiver may lag behind
replay (e.g., archive restore has delivered WAL faster than
streaming), so a 'standby_write' waiter could be waiting on WAL that
replay has already consumed.

I will write a patch to address this soon.

--
Best,
Xuneng
From b93a6e64cf2e9afcfa7ee4dee7c6dfab885a0585 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Tue, 7 Apr 2026 18:31:36 +0800
Subject: [PATCH v1 1/2] Remove redundant WAIT FOR LSN caller-side pre-checks

All five wakeup call sites duplicate WaitLSNWakeup()'s internal
fast-path minWaitedLSN check and add an unnecessary NULL check
on waitLSNState.

Remove the inline pre-checks and call WaitLSNWakeup() directly.
The fast-path check inside WaitLSNWakeup() already returns early
when no waiter's target has been reached, so there is no
performance difference.

The waitLSNState NULL checks are also unnecessary: shared memory
is fully initialized before any backend or auxiliary process
starts, so waitLSNState is always non-NULL at these call sites.
---
 src/backend/access/transam/xlog.c         | 16 ++++++----------
 src/backend/access/transam/xlogrecovery.c | 11 ++++-------
 src/backend/replication/walreceiver.c     | 17 ++++++-----------
 3 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 260fc801ce2..dd85bf52dc8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2936,12 +2936,10 @@ XLogFlush(XLogRecPtr record)
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
-	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 * Wake up processes waiting for primary flush LSN to reach current flush
+	 * position.
 	 */
-	if (waitLSNState &&
-		(LogwrtResult.Flush >=
-		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
-		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+	WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -3126,12 +3124,10 @@ XLogBackgroundFlush(void)
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
-	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 * Wake up processes waiting for primary flush LSN to reach current flush
+	 * position.
 	 */
-	if (waitLSNState &&
-		(LogwrtResult.Flush >=
-		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
-		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+	WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c236e2b7969..4f2eaa36990 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1782,14 +1782,11 @@ PerformWalRecovery(void)
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
 			/*
-			 * If we replayed an LSN that someone was waiting for then walk
-			 * over the shared memory array and set latches to notify the
-			 * waiters.
+			 * Wake up processes waiting for standby replay LSN to reach
+			 * current replay position.
 			 */
-			if (waitLSNState &&
-				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY,
+						  XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a437273cf9a..c7dcb3003b5 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -980,12 +980,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
 	/*
-	 * If we wrote an LSN that someone was waiting for, notify the waiters.
+	 * Wake up processes waiting for standby write LSN to reach current write
+	 * position.
 	 */
-	if (waitLSNState &&
-		(LogstreamResult.Write >=
-		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
-		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
 
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
@@ -1027,13 +1025,10 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		SpinLockRelease(&walrcv->mutex);
 
 		/*
-		 * If we flushed an LSN that someone was waiting for, notify the
-		 * waiters.
+		 * Wake up processes waiting for standby flush LSN to reach current
+		 * flush position.
 		 */
-		if (waitLSNState &&
-			(LogstreamResult.Flush >=
-			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
-			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
 
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
-- 
2.51.0

Attachment: v1-0002-Use-replay-position-as-floor-for-WAIT-FOR-LSN-sta.patch
Description: Binary data

Reply via email to