Re: Separate catalog_xmin from xmin in walsender hot standby feedback

Rui Zhao Sun, 14 Jun 2026 19:49:14 -0700

Hi Amit,

Thanks for the careful review. Following your suggestion to explore the
ephemeral-slot direction hinted at in the XXX comment in walsender.c
(ProcessStandbyHSFeedbackMessage), I built a working prototype and ran it
through the design questions you raised. Sharing what I found, since it
changes my recommendation.


== The prototype ==

When a standby connects WITHOUT a replication slot and sends hot standby
feedback, the walsender now lazily creates an *ephemeral* physical slot
(RS_EPHEMERAL, named pg_walsender_<pid>) on first feedback, and tracks the
two horizons through it:

if (MyReplicationSlot == NULL)
    ReplicationSlotCreate("pg_walsender_<pid>", false, RS_EPHEMERAL,
                          false, false, false, false,
                          /* error_if_full */ false);

if (MyReplicationSlot != NULL)
    PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
else
    /* graceful degradation -- see below */

This directly addresses both of your concerns:

(a) Per-PGPROC memory: nothing is added to PGPROC. The 4 bytes/proc that
  the previous patch spent on every backend are gone; only walsenders
  that actually receive feedback consume anything, and they consume a
  slot they were arguably entitled to all along.

(b) Torn read between xmin and catalog_xmin: PhysicalReplicationSlotNewXmin
  updates effective_xmin and effective_catalog_xmin together under
  slot->mutex, so ComputeXidHorizons never observes a half-updated pair.
  The atomicity falls out of the existing slot machinery for free.

== WAL retention ==

The obvious worry with a slot is restart_lsn pinning WAL. In practice this
is bounded and, I'd argue, desirable:

- The slot is ephemeral, so it is dropped automatically when the walsender
exits (before_shmem_exit -> ReplicationSlotShmemExit ->
ReplicationSlotRelease), including on standby disconnect/crash and on
crash-restart. There is no persistent slot left behind to pile up WAL
indefinitely -- which was the historical reason this path avoided slots.

- While the standby is connected, restart_lsn keeps the WAL the standby
still needs. That is precisely the behavior we want: today the no-slot
path can let the primary recycle WAL out from under a perfectly healthy
standby. Pinning it is a feature, not a regression.

- The worst case (connected but hopelessly lagging) is bounded by
max_slot_wal_keep_size, and a dead-but-unnoticed standby is bounded by
wal_sender_timeout (default 60s) before the walsender exits and the slot
drops.

== Slot-pool exhaustion ==

The one real pitfall is that ephemeral slots consume max_replication_slots.
If the pool is exhausted, ReplicationSlotCreate would ereport(ERROR), which
in the walsender would tear down the replication connection -- a regression
for feedback-only standbys that work fine today.

I solved this by adding an error_if_full flag to ReplicationSlotCreate
(default-equivalent true for all existing callers; behavior unchanged). When
false, a full pool releases ReplicationSlotAllocationLock and returns false
instead of erroring, and the walsender degrades gracefully to the pre-existing
behavior -- holding back the older of the two horizons via its own PGPROC
xmin:

/* slot pool exhausted: never break the connection */
MyProc->xmin = min(feedbackXmin, feedbackCatalogXmin);

So the separation is best-effort: you get the catalog/data split whenever a
slot is available, and you fall back to exactly today's behavior when it is
not. No connection is ever broken by this change.

== Visibility ==

One user-facing consequence worth flagging: each feedback standby now surfaces
a pg_walsender_<pid> physical slot in pg_replication_slots while it is
connected (it disappears when the walsender exits and the ephemeral slot is
dropped). Note the temporary column reads 'f' there, since the slot is
RS_EPHEMERAL rather than RS_TEMPORARY, so to a casual reader it looks like an
ordinary persistent physical slot distinguished only by its name prefix.

This is harmless functionally, but monitoring that counts slots or alerts on
"unexpected" slots will see these appear, and the 'f' in temporary may
mislead someone into thinking they won't be cleaned up automatically (they
are). The PGPROC approach has no such surface -- the horizon is held silently
on the walsender's proc entry. So this is a genuine trade-off between the two:
visibility/observability (you can actually see what each standby is pinning)
versus invisibility (nothing new shows up anywhere). I think surfacing it is
mostly a feature, but I wanted to call it out explicitly rather than have it
surprise anyone.

== Validation ==

I tested this on a live primary/standby pair (rebased onto current master).
A standby connects with no slot and hot_standby_feedback = on, and a logical
slot on the standby drives a catalog_xmin distinct from the data xmin. On the
primary, pg_replication_slots then shows:

slot_name    | pg_walsender_3908747
slot_type    | physical
xmin         | 696   <- data horizon
catalog_xmin | 695   <- catalog horizon

So the two horizons are tracked separately, exactly as intended. Before this
change the slotless path would have collapsed them into MyProc->xmin =
min(696, 695) = 695, needlessly holding the data-table vacuum horizon back by
one. With the patch the data horizon stays at 696 and only catalog vacuum is
held at 695.

I also confirmed the lifecycle: the ephemeral slot is created lazily on first
feedback, and it is dropped automatically when the standby disconnects (the
slot count on the primary returns to zero), so there is no WAL pile-up left
behind.

== Recommendation ==

Given the above, I lean toward the ephemeral-slot approach over the PGPROC
approach: it removes the per-proc cost, gets atomicity for free, and reuses
existing, well-understood slot machinery instead of introducing a parallel
horizon-tracking path. The new surface is small: the error_if_full flag, plus
the visibility trade-off noted above (a pg_walsender_<pid> slot per connected
feedback standby, versus the PGPROC approach holding the horizon silently).

The attached prototype compiles cleanly and passes the live test above on
current master. Before I post a formal patch I wanted your read on the
direction -- in particular whether consuming (and exposing) a
max_replication_slots entry per feedback standby is acceptable given the
graceful fallback, or whether you'd prefer the pool accounting handled
differently.

Thanks again,
Rui

v1-0001-Separate-catalog_xmin-from-xmin-via-ephemeral-phy.patch
Description: Binary data


> 2026年6月8日 19:29，Amit Kapila <[email protected]> 写道：
> 
> On Mon, Jun 8, 2026 at 1:27 PM Rui Zhao <[email protected]> wrote:
>> 
>> Attaching v2, rebased on top of current master (v1 no longer applied
>> cleanly per cfbot).  Two improvements were also made since v1:
>> 
>> 1. Test stability
>>   The TAP test (053_hs_feedback_catalog_xmin.pl) replaced a sleep(2)
>>   with a poll on pg_stat_replication.reply_time, so it deterministically
>>   waits for a fresh hot standby feedback round to reach the primary
>>   instead of relying on wall-clock time.  This removes the most obvious
>>   source of flakiness on slower CI runners.
>> 
>> 2. Performance impact (pgbench)
>>   I was concerned about adding a UINT32_ACCESS_ONCE(proc->catalog_xmin)
>>   read and an extra TransactionIdOlder() call per PGPROC iteration in
>>   ComputeXidHorizons(), which is on a hot path.  Measured on a 104-core
>>   box, scale=30, 64 clients / 32 jobs / 60s, -M prepared, optimized
>>   build (CFLAGS=-O2, no --enable-cassert):
>> 
>>   workload                 baseline median   patched median   delta
>>   ----------------------   ---------------   --------------   -----
>>   pgbench -S (read-only)        1,596,196        1,617,083    +1.3%
>>   pgbench    (read-write)         129,550          132,024    +1.9%
>> 
>>   Numbers are medians of 3 warm runs after a 30s warmup, after I
>>   discarded the cold-start runs which were dominated by cache priming.
>>   The patched build comes out marginally above baseline, which is of
>>   course noise -- adding code does not make things faster -- but it
>>   confirms the change is below the measurement floor on this hardware.
>>   Happy to rerun on different shapes (more clients, smaller scale,
>>   different machine) if anyone wants to see specific numbers.
>> 
>> Summary of the change (unchanged in substance from v1):
>> 
>>  - proc.h: add catalog_xmin field to PGPROC (4 bytes)
>>  - proc.c: initialize it in InitProcess / InitAuxiliaryProcess
>>  - procarray.c: accumulate proc_catalog_xmin in ComputeXidHorizons()
>>    and apply it only to catalog_oldest_nonremovable and
>>    shared_oldest_nonremovable; include it in GetReplicationHorizons()
>>    so the catalog_xmin propagates correctly in cascading setups
>>  - walsender.c: in the no-slot path of ProcessStandbyHSFeedbackMessage(),
>>    set MyProc->xmin and MyProc->catalog_xmin separately instead of
>>    folding them together
>> 
> 
> I think storing catalog_xmin separately in PGPROC has few downsides
> which are (a) it always needs additional four bytes in PGPROC which is
> used for backend and other processes even though it is required for
> walsender, (b) as both xmin and catalog_xmin are written separately
> ComputeXidHorizons() could read one value as updated and other stale.
> There may be something more fundamental which I may be missing but I
> feel ephermal slots idea as hinted in comments is worth exploring.
> 
> -- 
> With Regards,
> Amit Kapila.

Re: Separate catalog_xmin from xmin in walsender hot standby feedback

Reply via email to