Hi, On Wed, Oct 15, 2025 at 4:43 PM Xuneng Zhou <[email protected]> wrote: > > Hi, > > On Wed, Oct 15, 2025 at 8:31 AM Xuneng Zhou <[email protected]> wrote: > > > > Hi, > > > > On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <[email protected]> wrote: > > > > > > Hi, > > > > > > The following is the split patch set. There are certain limitations to > > > this simplification effort, particularly in patch 2. The > > > read_local_xlog_page_guts callback demands more functionality from the > > > facility than the WAIT FOR patch — specifically, it must wait for WAL > > > flush events, though it does not require timeout handling. In some > > > sense, parts of patch 3 can be viewed as a superset of the WAIT FOR > > > patch, since it installs wake-up hooks in more locations. Unlike the > > > WAIT FOR patch, which only needs wake-ups triggered by replay, > > > read_local_xlog_page_guts must also handle wake-ups triggered by WAL > > > flushes. > > > > > > Workload characteristics play a key role here. A sorted dlist performs > > > well when insertions and removals occur in order, achieving O(1) > > > complexity in the best case. In synchronous replication, insertion > > > patterns seem generally monotonic with commit LSNs, though not > > > strictly ordered due to timing variations and contention. When most > > > insertions remain ordered, a dlist can be efficient. However, as the > > > number of elements grows and out-of-order insertions become more > > > frequent, the insertion cost can degrade to O(n) more often. > > > > > > By contrast, a pairing heap maintains stable O(1) insertion for both > > > ordered and disordered inputs, with amortized O(log n) removals. Since > > > LSNs in the WAIT FOR command are likely to arrive in a non-sequential > > > fashion, the pairing heap introduced in v6 provides more predictable > > > performance under such workloads. > > > > > > At this stage (v7), no consolidation between syncrep and xlogwait has > > > been implemented. This is mainly because the dlist and pairing heap > > > each works well under different workloads — neither is likely to be > > > universally optimal. Introducing the facility with a pairing heap > > > first seems reasonable, as it offers flexibility for future > > > refactoring: we could later replace dlist with a heap or adopt a > > > modular design depending on observed workload characteristics. > > > > > > > v8-0002 removed the early fast check before addLSNWaiter in > > WaitForLSNReplay, > > as the likelihood of a server state change is small compared to the > > branching cost and added code complexity. > > > > Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.
Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay wake-up calls (447aae1) are in place, this patch has been updated to make use of them. Please check. Best, Xuneng
From ec9625408c81ac78e4d7ecf6a459e258d56482fb Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Fri, 7 Nov 2025 21:33:05 +0800 Subject: [PATCH v10] Improve read_local_xlog_page_guts by replacing polling with latch-based waiting. Replace inefficient polling loops in read_local_xlog_page_guts with facilities developed in xlogwait module when WAL data is not yet available. This eliminates CPU-intensive busy waiting and improves responsiveness by waking processes immediately when their target LSN becomes available. --- src/backend/access/transam/xlog.c | 20 +++++++++++ src/backend/access/transam/xlogutils.c | 46 ++++++++++++++++++++++---- src/backend/replication/walsender.c | 4 --- 3 files changed, 59 insertions(+), 11 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 101b616b028..f86f8b9d8cd 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2915,6 +2915,16 @@ XLogFlush(XLogRecPtr record) /* wake up walsenders now that we've released heavily contended locks */ WalSndWakeupProcessRequests(true, !RecoveryInProgress()); + /* + * If we flushed an LSN that someone was waiting for then walk + * over the shared memory array and set latches to notify the + * waiters. + */ + if (waitLSNState && + (LogwrtResult.Flush >= + pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH]))) + WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush); + /* * If we still haven't flushed to the request point then we have a * problem; most likely, the requested flush point is past end of XLOG. @@ -3097,6 +3107,16 @@ XLogBackgroundFlush(void) /* wake up walsenders now that we've released heavily contended locks */ WalSndWakeupProcessRequests(true, !RecoveryInProgress()); + /* + * If we flushed an LSN that someone was waiting for then walk + * over the shared memory array and set latches to notify the + * waiters. + */ + if (waitLSNState && + (LogwrtResult.Flush >= + pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH]))) + WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush); + /* * Great, done. To take some work off the critical path, try to initialize * as many of the no-longer-needed WAL buffers for future use as we can. diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index ce2a3e42146..39a18744cae 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -23,6 +23,7 @@ #include "access/xlogrecovery.h" #include "access/xlog_internal.h" #include "access/xlogutils.h" +#include "access/xlogwait.h" #include "miscadmin.h" #include "storage/fd.h" #include "storage/smgr.h" @@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr, loc = targetPagePtr + reqLen; /* - * Loop waiting for xlog to be available if necessary - * - * TODO: The walsender has its own version of this function, which uses a - * condition variable to wake up whenever WAL is flushed. We could use the - * same infrastructure here, instead of the check/sleep/repeat style of - * loop. + * Waiting for xlog to be available if necessary. */ while (1) { @@ -947,7 +943,43 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr, } CHECK_FOR_INTERRUPTS(); - pg_usleep(1000L); + + /* + * Wait for LSN using appropriate method based on server state. + */ + if (!RecoveryInProgress()) + { + /* Primary: wait for flush */ + WaitForLSN(WAIT_LSN_TYPE_FLUSH, loc, -1); + } + else + { + /* Standby: wait for replay */ + WaitLSNResult result = WaitForLSN(WAIT_LSN_TYPE_REPLAY, loc, -1); + + switch (result) + { + case WAIT_LSN_RESULT_SUCCESS: + /* LSN was replayed, loop back to recheck timeline */ + break; + + case WAIT_LSN_RESULT_NOT_IN_RECOVERY: + /* + * Promoted while waiting. This is the tricky case. + * We're now a primary, so loop back and use flush + * logic instead of replay logic. + */ + break; + + default: + elog(ERROR, "unexpected wait result"); + } + } + + /* + * Loop back to recheck everything. + * Timeline might have changed during our wait. + */ } else { diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index fc8f8559073..1821bf31539 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1036,10 +1036,6 @@ StartReplication(StartReplicationCmd *cmd) /* * XLogReaderRoutine->page_read callback for logical decoding contexts, as a * walsender process. - * - * Inside the walsender we can do better than read_local_xlog_page, - * which has to do a plain sleep/busy loop, because the walsender's latch gets - * set every time WAL is flushed. */ static int logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen, -- 2.51.0
