Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Xuneng Zhou Fri, 07 Nov 2025 05:49:05 -0800

Hi,

On Wed, Oct 15, 2025 at 4:43 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi,
>
> On Wed, Oct 15, 2025 at 8:31 AM Xuneng Zhou <[email protected]> wrote:
> >
> > Hi,
> >
> > On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > The following is the split patch set. There are certain limitations to
> > > this simplification effort, particularly in patch 2. The
> > > read_local_xlog_page_guts callback demands more functionality from the
> > > facility than the WAIT FOR patch — specifically, it must wait for WAL
> > > flush events, though it does not require timeout handling. In some
> > > sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
> > > patch, since it installs wake-up hooks in more locations. Unlike the
> > > WAIT FOR patch, which only needs wake-ups triggered by replay,
> > > read_local_xlog_page_guts must also handle wake-ups triggered by WAL
> > > flushes.
> > >
> > > Workload characteristics play a key role here. A sorted dlist performs
> > > well when insertions and removals occur in order, achieving O(1)
> > > complexity in the best case. In synchronous replication, insertion
> > > patterns seem generally monotonic with commit LSNs, though not
> > > strictly ordered due to timing variations and contention. When most
> > > insertions remain ordered, a dlist can be efficient. However, as the
> > > number of elements grows and out-of-order insertions become more
> > > frequent, the insertion cost can degrade to O(n) more often.
> > >
> > > By contrast, a pairing heap maintains stable O(1) insertion for both
> > > ordered and disordered inputs, with amortized O(log n) removals. Since
> > > LSNs in the WAIT FOR command are likely to arrive in a non-sequential
> > > fashion, the pairing heap introduced in v6 provides more predictable
> > > performance under such workloads.
> > >
> > > At this stage (v7), no consolidation between syncrep and xlogwait has
> > > been implemented. This is mainly because the dlist and pairing heap
> > > each works well under different workloads — neither is likely to be
> > > universally optimal. Introducing the facility with a pairing heap
> > > first seems reasonable, as it offers flexibility for future
> > > refactoring: we could later replace dlist with a heap or adopt a
> > > modular design depending on observed workload characteristics.
> > >
> >
> > v8-0002 removed the early fast check before addLSNWaiter in 
> > WaitForLSNReplay,
> > as the likelihood of a server state change is small compared to the
> > branching cost and added code complexity.
> >
>
> Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.


Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

Best,
Xuneng

From ec9625408c81ac78e4d7ecf6a459e258d56482fb Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Fri, 7 Nov 2025 21:33:05 +0800
Subject: [PATCH v10] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting.

Replace inefficient polling loops in read_local_xlog_page_guts with facilities developed in
xlogwait module when WAL data is not yet available.  This eliminates CPU-intensive busy
waiting and improves responsiveness by waking processes immediately when their target
LSN becomes available.
---
 src/backend/access/transam/xlog.c      | 20 +++++++++++
 src/backend/access/transam/xlogutils.c | 46 ++++++++++++++++++++++----
 src/backend/replication/walsender.c    |  4 ---
 3 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 101b616b028..f86f8b9d8cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2915,6 +2915,16 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3097,6 +3107,16 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index ce2a3e42146..39a18744cae 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -947,7 +943,43 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSN(WAIT_LSN_TYPE_FLUSH, loc, -1);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSN(WAIT_LSN_TYPE_REPLAY, loc, -1);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fc8f8559073..1821bf31539 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1036,10 +1036,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-- 
2.51.0

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Reply via email to