Hi, On Tue, Dec 16, 2025 at 6:20 PM Xuneng Zhou <[email protected]> wrote: > > Hi, > > On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <[email protected]> wrote: > > > > Hi hackers, > > > > I'd like to propose an optimization for logical decoding's snapshot building > > mechanism that eliminates repeated sorting overhead once a replication slot > > reaches the CONSISTENT state. > > > > 1) Problem > > > > Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on > > every call using qsort(). Once a logical decoding slot reaches CONSISTENT > > state, snapshots are built frequently—after every catalog-modifying > > transaction > > commit. This repeated O(n log n) sorting becomes a performance bottleneck as > > the committed transaction array grows. > > > > The existing code even has a TODO comment acknowledging this inefficiency: > > "TODO: It's unclear whether that reasoning has much merit. Every time we add > > something here after becoming consistent will also require distributing a > > snapshot." > > > > 2) Proposed Solution > > > > Maintain sorted order via batch merge on each commit: > > > > 1. Collect all relevant XIDs (including subtransactions) into a batch array > > 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts) > > 3. Merge into the global array: O(n + m) via reverse in-place merge > > > > While per-commit cost increases from O(m) to O(m log m + n), this is offset > > by > > eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state > > builds snapshots after each catalog-modifying commit, the amortized cost is > > significantly lower. > > > > 3) Performance Impact > > > > Expected improvements in CONSISTENT state: > > - Eliminates O(n log n) qsort on every snapshot build > > - Replaces with O(m log m + n) merge per commit > > - For typical cases where m << n and snapshots are frequent, this is a win > > > > > > The benchmark results for this patch are shown below. > > > > 4) Configuration > > > > shared_buffers = '4GB' > > wal_level = logical > > max_replication_slots = 10 > > max_wal_senders = 10 > > log_min_messages = warning > > max_connections = 600 > > autovacuum = off > > checkpoint_timeout = 15min > > max_wal_size = 4GB > > > > > > 5) Workloads > > > > # Workload 1: DDL - High frequency, small commits > > create_ddl_workload() { > > local ROOT="$1" > > cat >"$ROOT/ddl.sql" <<'SQL' > > -- High-frequency small catalog commits > > -- Each commit triggers SnapBuildBuildSnapshot > > DO $$ > > DECLARE > > tbl text := format('temp_%s_%s', > > current_setting('application_name', true), > > floor(random()*1e9)::int); > > BEGIN > > EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT > > DROP', tbl); > > EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl); > > EXECUTE format('SELECT * FROM %I', tbl); > > END$$; > > SQL > > } > > > > # Workload 2: MIXED - mix of DDL and DML, 50%-50% > > create_mixed_workload() { > > local ROOT="$1" > > cat >"$ROOT/mixed_ddl.sql" <<'SQL' > > -- DDL workload (catalog changes) > > DO $$ > > DECLARE > > tbl text := format('t_%s_%s', > > current_setting('application_name', true), > > floor(random()*1e9)::int); > > BEGIN > > EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl); > > EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl); > > END$$; > > > > SQL > > cat >"$ROOT/mixed_dml.sql" <<'SQL' > > -- DML workload (no catalog changes) > > INSERT INTO app_data (id, data) > > VALUES (floor(random()*1e6)::int, repeat('x', 100)) > > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100); > > SQL > > } > > > > # Workload 3: CONTROL - Pure DML, no catalog changes > > create_control_workload() { > > local ROOT="$1" > > cat >"$ROOT/control.sql" <<'SQL' > > > > -- Pure DML, no catalog changes > > -- Should show no difference between baseline and patched > > INSERT INTO control_data (id, data) > > VALUES (floor(random()*1e6)::int, repeat('x', 100)) > > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100); > > SQL > > } > > > > > > 6) Test strategy > > > > Clients: 100 concurrent connections per workload > > Duration: 40 seconds per run > > Repetitions: 1 run per workload type > > Replication: Logical replication slot using `test_decoding` plugin > > with `EXPORT_SNAPSHOT=true` > > > > Workload Types: > > 1. DDL - Pure catalog churn (temp table create/drop) > > 2. Mixed- 50% DDL + 50% DML (workload) > > 3. Control - Pure DML (no catalog changes) > > > > Measurement: > > - Metrics captured from pg_stat_replication_slots before/after each run > > - Primary metrics: total_txns (transactions decoded) and total_bytes > > (data volume) > > - Compared baseline (vanilla PostgreSQL) vs patched (sorted > > committed.xip optimization) > > > > > > 7) Performance results > > > > DDL Workload: +235% Decoder Improvement > > Decoder throughput: 713.76 → 2396.52 txns/sec (+235%) > > Throughput (MB/s): 672.67 → 1747.22 MB/s (+159%) > > Decode efficiency: 9.46% → 33.29% (+23.83 points) > > > > Mixed Workload: No Change > > Decoder throughput: 2751.10 → 2730.00 txns/sec (0%) > > Decode efficiency: 40.47% → 40.47% (unchanged) > > > > We can see that the qsort overhead in SnapBuild has been eliminated in > > the flamegraphs in the mixed workload. However, the performance > > improvement was not observed as in the DDL workload. My guess is that, > > for DML workloads, ReorderBufferApplyChange has become the new > > hotspot. > > > > DML Workload: No Change > > Decoder throughput: 3062.57 → 3066.37 txns/sec (0%) > > Decode efficiency: 49.97% → 49.97% (unchanged) > > > > === Workload: ddl === > > Client commits/sec: > > Baseline: 7545.76 commits/sec > > Patched: 7198.21 commits/sec > > > > Decoder throughput (from pg_stat_replication_slots): > > Baseline: 713.76 txns/sec (672.67 MB/s) > > Patched: 2396.52 txns/sec (1747.22 MB/s) > > > > Transaction efficiency (decoded vs committed): > > Baseline: 309376 committed → 29264 decoded (9.46%) > > Patched: 302325 committed → 100654 decoded (33.29%) > > > > Total decoded (all reps): > > Baseline: 29264 txns (27579.53 MB) > > Patched: 100654 txns (73383.10 MB) > > > > Decoder improvement: +235.00% (txns/sec) > > Decoder improvement: +159.00% (MB/s) > > Efficiency improvement: +23.83% points (more transactions decoded > > per committed) > > > > === Workload: mixed === > > Client commits/sec: > > Baseline: 6797.50 commits/sec > > Patched: 6745.26 commits/sec > > > > Decoder throughput (from pg_stat_replication_slots): > > Baseline: 2751.10 txns/sec (210.35 MB/s) > > Patched: 2730.00 txns/sec (205.64 MB/s) > > > > Transaction efficiency (decoded vs committed): > > Baseline: 285495 committed → 115546 decoded (40.47%) > > Patched: 283301 committed → 114660 decoded (40.47%) > > > > Total decoded (all reps): > > Baseline: 115546 txns (8834.71 MB) > > Patched: 114660 txns (8636.96 MB) > > > > ≈ Decoder unchanged: 0.00% (txns/sec) > > > > === Workload: DML === > > Client commits/sec: > > Baseline: 6129.24 commits/sec > > Patched: 6136.93 commits/sec > > > > Decoder throughput (from pg_stat_replication_slots): > > Baseline: 3062.57 txns/sec (0.26 MB/s) > > Patched: 3066.37 txns/sec (0.26 MB/s) > > > > Transaction efficiency (decoded vs committed): > > Baseline: 257428 committed → 128628 decoded (49.97%) > > Patched: 251614 committed → 125721 decoded (49.97%) > > > > Total decoded (all reps): > > Baseline: 128628 txns (10.98 MB) > > Patched: 125721 txns (10.69 MB) > > > > ≈ Decoder unchanged: 0.00% (txns/sec) > > > > 8) Potential regression > > > > The potential regression point could be before the slot reaches the > > CONSISTENT state, particularly when building_full_snapshot is set to > > true. In this phase, all transactions including those that don’t > > modify the catalog — must be added to the committed.xip array. These > > XIDs don’t require later snapshot builds or sorting, so the > > batch-insert logic increases the per-insert cost from O(1) to O(m + n) > > without providing a direct benefit. > > > > However, the impact of this regression could be limited. The system > > remains in the pre-CONSISTENT phase only briefly during initial > > snapshot building, and the building_full_snapshot = true case is rare, > > mainly used when creating replication slots with the EXPORT_SNAPSHOT > > option. > > > > Once the slot becomes CONSISTENT, only catalog-modifying transactions > > are tracked in committed.xip, and the patch reduces overall > > snapshot-building overhead by eliminating repeated full-array sorts. > > > > We could also adopt a two-phase approach — keeping the current > > behavior before reaching the CONSISTENT state and maintaining a sorted > > array only after that point. This would preserve the performance > > benefits while avoiding potential regressions. However, it would > > introduce additional complexity and potential risks in handling the > > state transitions. > > > > if (builder->state < SNAPBUILD_CONSISTENT) > > { > > /* ensure that only commits after this are getting replayed */ > > if (builder->start_decoding_at <= lsn) > > builder->start_decoding_at = lsn + 1; > > > > /* > > * If building an exportable snapshot, force xid to be tracked, even > > * if the transaction didn't modify the catalog. > > */ > > if (builder->building_full_snapshot) > > { > > needs_timetravel = true; > > } > > } > > After some consideration, a two-phase sorting strategy seems feasible > to implement in a relatively straightforward manner. So it's done in > v2. I also plan to run benchmarks to evaluate potential regressions of > the original “sort-at-all-stages” approach of v1. > > > 9) Additional benefit > > With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use > > two binary searchs rather than interating and copying to find the interval > > to keep in the sorted commited.xip array. [1] > > > > Feedbacks welcomed. > > > > [1] > > https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mail.gmail.com > > V2 fixes several issues in v1, including a potential memory leak, type > inconsistencies, and applies pgindent to the files. >
V3 fixes a critical issue where the snapshot->xip array in SnapBuildBuildSnapshot might not be sorted before reaching the consistent state. Sorry for the noise here. -- Best, Xuneng
v3-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch
Description: Binary data
