Hi hackers, When a long-running write transaction coexists with many concurrent catalog-modifying commits (e.g., autovacuum updating pg_class stats for many tables, or batch DDL), the reorder buffer's spill files for logical decoding can grow quadratically, reaching ~80GB in our production with N=100K extrapolation. This patch eliminates the problem by deferring snapshot distribution from commit-time to data-change-time.
== Problem ==
The trigger requires two conditions:
(i) Something prevents xmin from advancing. This can be the long write
transaction itself (its own XID holds xmin back), or an unrelated
xmin-holder like pg_dump's REPEATABLE READ session or a slow
logical-replication consumer.
(ii) A transaction that has at least one data change is present in the
reorder buffer. Pure read-only sessions (pg_dump, long SELECT) do
not appear in the reorder buffer and are not victims themselves;
they only enable condition (i).
When both are met, the chain reaction runs across four stages:
1) VACUUM updates pg_class statistics via heap_inplace_update_and_unlock(),
which triggers CacheInvalidateHeapTuple(). At command end,
LogLogicalInvalidations() emits an XLOG_XACT_INVALIDATIONS WAL record.
2) During logical decoding, xact_decode() processes XLOG_XACT_INVALIDATIONS
by calling ReorderBufferXidSetCatalogChanges(), marking the VACUUM
transaction as having catalog changes.
3) When the VACUUM transaction commits, SnapBuildCommitTxn() sees it has
catalog changes, builds a new snapshot via SnapBuildBuildSnapshot(), and
calls SnapBuildDistributeSnapshotAndInval() to distribute the snapshot
to ALL in-progress transactions.
4) The snapshot includes the entire builder->committed.xip array. Since a
long-running transaction prevents SnapBuildPurgeOlderTxn() from cleaning
old entries (builder->xmin cannot advance), the committed array grows
monotonically. Each successive snapshot is larger than the last:
Snapshot N contains N xids, size = 192 + 4*N bytes
Total disk usage = sum(i=1..N)(192 + 4*i) = 2*N^2 bytes
In our measurements (see Performance impact below), master spills
247 MB at N=5000 and 813 MB at N=10000, scaling cleanly as O(N^2).
Extrapolation to N=100K yields ~80 GB.
Beyond disk usage, these accumulated snapshots also degrade decoding
performance:
a) Memory pressure: the snapshots consume logical_decoding_work_mem
quickly, forcing the reorder buffer to spill to disk much earlier
than necessary.
b) Spill/restore I/O: each INTERNAL_SNAPSHOT change is serialized
during spill and deserialized during commit-time processing,
adding significant disk I/O for what is essentially dead data.
c) Commit-time iteration overhead: ReorderBufferProcessTXN() iterates
all changes via a binary heap. 100K extra INTERNAL_SNAPSHOT entries
mean 100K additional heap pop operations and snapshot_now switches
(refcount inc/dec, potential old snapshot free).
d) Replication lag: the combined CPU and I/O overhead directly
increases the time between a transaction's commit and its decoded
output reaching subscribers.
== Why not filter VACUUM? ==
One might consider filtering out VACUUM-generated XLOG_XACT_INVALIDATIONS
so they don't trigger snapshot rebuilds. However, the O(N^2) problem is
not specific to VACUUM. Any workload that generates many catalog-modifying
commits triggers it: batch partition creation, online migration tools
running thousands of DDLs, extension installations creating many functions
and types, etc. A general solution at the distribution layer is needed.
== Prior work ==
The distribution function modified by this patch was last touched by the
"long-standing data loss bug in initial sync of logical replication"
thread started by Tomas Vondra in 2023 [1], committed by Amit Kapila in
2025. That fix renamed
SnapBuildDistributeNewCatalogSnapshot
to SnapBuildDistributeSnapshotAndInval and added invalidation
distribution alongside the existing snapshot distribution, to ensure
that in-progress transactions invalidate their relcache when concurrent
publication DDL changes the set of published tables. The commit
message acknowledged "some performance regression ... primarily during
frequent execution of publication DDL statements". The present patch
addresses the snapshot-distribution side of that overhead while keeping
the invalidation-distribution semantics unchanged, since invalidations
cannot be deduplicated the way snapshots can (see "Correctness" below).
Related but orthogonal work in the same area:
- Sawada's "Using per-transaction memory contexts for storing decoded
tuples" [2] addresses a different root cause of decoding memory
bloat under long-running transactions: GenerationContext
fragmentation in the reorder buffer's tuple context. His patch is
complementary; both contribute to keeping decoding memory usage
within logical_decoding_work_mem.
- Tachoires's "Compress ReorderBuffer spill files using LZ4" [3]
reduces the disk cost of spilled changes via compression. The two
approaches are complementary: this patch reduces what needs to be
spilled in the first place (eliminating O(N^2) INTERNAL_SNAPSHOT
entries for write transactions with few data changes), while [3]
reduces the cost of whatever does spill.
[1]
https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
[2]
https://www.postgresql.org/message-id/flat/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/flat/CAA4eK1Kd6vP1g5pJXoiPgfLQsn54hecsc06hxpNWV-9_xMa%2B5Q%40mail.gmail.com
== Solution: Lazy Snapshot Distribution ==
Instead of distributing a snapshot to every in-progress transaction on each
catalog-modifying commit, we defer distribution until a transaction actually
needs to decode a data change (in SnapBuildProcessChange()).
A generation counter (SnapBuild.snapshot_generation) is incremented each
time a new catalog snapshot is built. Each transaction tracks the last
generation it received (ReorderBufferTXN.last_snapshot_generation). When
SnapBuildProcessChange() is called for a data change, it checks whether the
transaction's generation is behind the builder's and distributes the current
snapshot only if needed.
This means:
- A long-running write transaction with K data changes accumulates at
most min(K, G) snapshots in its change list (where G is the number of
catalog-modifying commits during its lifetime), instead of G snapshots
each of growing size. The generation counter deduplicates further:
consecutive data changes without intervening DDLs share one snapshot.
- Total disk usage attributable to snapshot distribution drops from
O(N^2) to O(min(K, G) * N), where N is the committed xid array size.
For the common production case where K << G (e.g. a long batch INSERT
during a vacuum storm), the savings are substantial.
Invalidation messages are still distributed eagerly as before, since they
are small (100K vacuum commits = 14MB) and already have overflow protection
via MAX_DISTR_INVAL_MSG_PER_TXN.
== Correctness ==
The lazy approach is equivalent to eager distribution because:
1) WAL is processed strictly in LSN order. When SnapBuildProcessChange()
is called at LSN X, builder->snapshot reflects exactly all catalog-
modifying commits with LSN < X. It is neither "too new" (future commits
haven't been processed) nor "too old" (all past commits are included).
2) The committed array is append-only (only grows, never shrinks while a
long transaction holds xmin back). Snapshot N is a superset of snapshots
1..N-1. Skipping intermediate snapshots does not affect visibility.
3) In the original code, when a transaction's changes are decoded at commit
time, each data change uses the most recent snapshot preceding it in the
change list. Intermediate snapshots that have no data changes between
them are effectively unused. Lazy distribution simply avoids creating
these unused entries.
4) This works correctly with streaming mode (changes decoded before commit)
because SnapBuildProcessChange() is always called before each data change
regardless of whether the change is later streamed or buffered.
5) The lazy snapshot is distributed to the toplevel transaction (using
txn->xid), consistent with the original eager distribution. When a
data change belongs to a subtransaction, the snapshot and the data
change reside in different change lists (toplevel vs subtransaction).
This is safe because eager invalidation distribution always places an
invalidation change entry in the toplevel's list at the DDL commit
LSN (which is strictly less than the data change LSN). During binary
heap iteration, this invalidation entry ensures the toplevel is at
the heap root when the equal-LSN comparison between the snapshot and
the data change occurs, so the snapshot is always processed first.
== Why not eliminate ReorderBufferAddSnapshot entirely? ==
One might ask: if we can defer snapshot distribution, why not skip storing
snapshots in the reorder buffer altogether and just use builder->snapshot
directly at decode time?
This doesn't work because DecodeInsert/DecodeUpdate only package the raw
WAL tuple into a ReorderBufferChange and enqueue it. The actual catalog
lookup (RelationIdGetRelation, tuple interpretation) happens later during
commit-time processing in ReorderBufferProcessTXN. At that point,
builder->snapshot has advanced past all changes and no longer reflects the
catalog state at each individual change's LSN.
The INTERNAL_SNAPSHOT entries in the change list serve as markers that tell
commit-time processing "switch to this catalog snapshot from here on."
Without them, all changes would use the base_snapshot, and an INSERT that
happened before an ALTER TABLE ADD COLUMN would be decoded with the new
schema, producing incorrect output (extra columns with NULL values).
Since each INTERNAL_SNAPSHOT entry serves all subsequent data changes until
the next generation bump, the total number of entries is min(K, G), which
is already minimal.
== Notes ==
- RBTXN_HAS_CATALOG_CHANGES is still needed: it is used by
SnapBuildCommitTxn() to decide whether to rebuild the snapshot,
ReorderBufferBuildTupleCidHash() for combo CID handling, and
SnapBuildSerialize() for serialization. This patch does not change
its semantics.
- SNAPBUILD_VERSION is bumped from 6 to 7 because snapshot_generation
is added to the SnapBuild struct which is serialized to disk. On
upgrade, existing serialized snapshots will be invalidated and rebuilt
automatically, which is the standard behavior for version mismatches.
== Changes ==
src/include/replication/snapbuild_internal.h:
- Add uint64 snapshot_generation to struct SnapBuild
src/include/replication/reorderbuffer.h:
- Add uint64 last_snapshot_generation to ReorderBufferTXN
- Declare ReorderBufferTXNByXid() as extern
src/backend/replication/logical/snapbuild.c:
- SnapBuildCommitTxn(): increment snapshot_generation after building
a new snapshot
- Rename SnapBuildDistributeSnapshotAndInval() to SnapBuildDistributeInval()
and remove snapshot distribution code; keep invalidation distribution
- SnapBuildProcessChange(): do a single ReorderBufferTXNByXid() lookup,
reuse the txn pointer for both base snapshot check and lazy catalog
snapshot distribution. The lazy snapshot is distributed to the
toplevel transaction, consistent with the original eager approach.
This also eliminates redundant hash lookups that existed in the
original code path.
- Bump SNAPBUILD_VERSION from 6 to 7
src/backend/replication/logical/reorderbuffer.c:
- Remove 'static' from ReorderBufferTXNByXid()
== Performance impact ==
Theoretical:
Before After
Snapshot distributions G (per commit) 0 or 1 (per data change)
Disk usage growth O(N^2) O(min(K, G) * N)
where N = committed xid array size at the time of the K-th data change,
G = number of catalog-modifying commits during the long txn,
K = number of data changes in the long txn.
The hot path (SnapBuildProcessChange for each data change) now does a single
hash lookup instead of 2-4 separate lookups in the original code, so there
is no performance regression for the common case.
Measured (single-threaded benchmark, scripts attached):
Scenario: one long-running transaction with K=1 INSERT, coexisting with
N concurrent CREATE/DROP TABLE pairs (each its own catalog-modifying
commit), then drained via pg_logical_slot_get_changes() with the
test_decoding output plugin. 3-run median per cell.
logical_decoding_work_mem = 64MB (production-like default):
N decode_ms spill_bytes total_bytes
---- ------------- ------------- -------------
1000 master 45 master 0 master 22 MB
patch 39 patch 0 patch 13 MB
5000 master 629 master 207 MB master 429 MB
patch 261 patch 0 patch 227 MB
10000 master 1865 master 813 MB master 1659 MB
patch 814 patch 0 patch 855 MB
At ldwm=64MB the patched code does not spill at all up to N=10000;
master spills 207 MB at N=5000 and 813 MB at N=10000, scaling
quadratically in N. Decoding time drops 2.3-2.4x at N>=5000,
purely from avoided spill I/O. Memory residency (total_bytes)
is halved as well, because INTERNAL_SNAPSHOT entries no longer
accumulate in the long transaction's change list.
logical_decoding_work_mem = 64kB (stress, forces spill even at small N):
N decode_ms spill_bytes total_bytes
---- ------------- ------------- -------------
500 master 28 master 2.6 MB master 6.9 MB
patch 24 patch 0.4 MB patch 4.7 MB
1000 master 57 master 9.3 MB master 22 MB
patch 42 patch 0.9 MB patch 13 MB
2000 master 143 master 35 MB master 76 MB
patch 82 patch 1.8 MB patch 43 MB
5000 master 607 master 247 MB master 429 MB
patch 289 patch 25 MB patch 227 MB
Under stress, the patch reduces spill bytes by 6-19x and decoding
time by 1.2-2.1x. Master's spill grows as ~10*N^2 bytes (clean
O(N^2) signal: 5000^2 / 1000^2 = 25, spill 247 MB / 9.3 MB = 26.6).
The patch's residual spill is dominated by invalidation-message
entries (still distributed eagerly by design); these are bounded
by MAX_DISTR_INVAL_MSG_PER_TXN.
The decoded change count is identical between master and patch in
every cell (3 changes: BEGIN + INSERT + COMMIT of the long
transaction; DDL transactions emit no test_decoding output), so
the patch preserves decoding output exactly.
Extrapolating to N=100K using the observed quadratic coefficient on
master and linear on patch:
N=100K master ~80 GB spill (vs the 2*N^2 lower-bound
estimate of 20 GB; the observed
coefficient is ~4x higher due to
per-entry ReorderBufferChange
overhead and accumulated
invalidation entries)
patch 0 spill at 64MB ldwm
<500 MB at 64kB ldwm
(residual is invalidation entries, bounded by
MAX_DISTR_INVAL_MSG_PER_TXN)
Scripts to reproduce are in the second patch (v1-0002):
bench/setup_cluster.sh, bench/lazy_snapshot_bench.sh,
bench/run_matrix.sh, bench/aggregate.sh.
== Limitations ==
When the long-running transaction itself has continuous data changes
(e.g. COPY to many tables), K approaches G and lazy distribution
degenerates toward eager behavior:
K small (typical) K~G (bulk COPY)
Eager (before) O(G^2) O(G^2)
Lazy (this patch) O(K*N) O(G^2)
For the bulk COPY scenario, the root cause is that VACUUM's pg_class
statistics updates (relpages, reltuples, relallvisible) are treated as
catalog changes even though they do not affect tuple decoding. A
complementary optimization could filter these non-schema-affecting
catalog modifications from triggering snapshot rebuilds. That is a
separate change with its own considerations and is left as future work.
== Tests ==
1) Isolation test (contrib/test_decoding/specs/lazy_snapshot_distribution.spec):
Three scenarios verifying decoding correctness with lazy distribution:
- Long transaction + multiple ALTER TABLE ADD COLUMN + INSERT with
new schema
- Long transaction + many CREATE TABLE (catalog changes) + INSERT
- Subtransaction + DDL + INSERT with new schema
2) TAP test (contrib/test_decoding/t/002_lazy_snapshot_spill.pl):
Verifies that a long transaction with 200 catalog-modifying DDLs in
between its two INSERTs results in spill_bytes=0, confirming that lazy
distribution eliminates snapshot-caused spilling.
3) All existing tests pass without modification:
- test_decoding: 20 regression + 15 isolation + 1 TAP
- src/test/subscription: 39 TAP tests (581 test cases), covering
streaming, subtransactions, DDL, two-phase commit, etc.
- src/test/recovery: 51 TAP tests (635 test cases), including
006_logical_decoding and 038_save_logical_slots_shutdown
- contrib/pg_logicalinspect: 2 tests
Two patches attached:
v1-0001-Lazy-snapshot-distribution-in-logical-decoding.patch
The lazy distribution change itself plus isolation and TAP tests.
v1-0002-Benchmark-scripts-for-lazy-snapshot-distribution.patch
Reproducer scripts for the measurements above (bench/). Kept as a
separate patch since it lives outside the contrib/ tree and is not
intended for backport or for production builds.
Thanks,
Rui Zhao
v1-0001-Lazy-snapshot-distribution-in-logical-decoding.patch
Description: Binary data
v1-0002-Benchmark-scripts-for-lazy-snapshot-distribution.patch
Description: Binary data
