On 18/08/2025 9:56 AM, Nisha Moond wrote:
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
Here is the initial POC patch for this idea.

Thank you Hou-san for the patch.

I did some performance benchmarking for the patch and overall, the
results show substantial performance improvements.
Please find the details as follows:

Source code:
----------------
pgHead (572c0f1b0e) and v1-0001 patch

Setup:
---------
Pub --> Sub
  - Two nodes created in pub-sub logical replication setup.
  - Both nodes have the same set of pgbench tables created with scale=300.
  - The sub node is subscribed to all the changes from the pub node's
pgbench tables.

Workload Run:
--------------------
  - Disable the subscription on Sub node
  - Run default pgbench(read-write) only on Pub node with #clients=40
and run duration=10 minutes
  - Enable the subscription on Sub once pgbench completes and then
measure time taken in replication.
~~~

Test-01: Measure Replication lag
----------------------------------------
Observations:
---------------
  - Replication time improved as the number of parallel workers
increased with the patch.
  - On pgHead, replicating a 10-minute publisher workload took ~46 minutes.
  - With just 2 parallel workers (default), replication time was cut in
half, and with 8 workers it completed in ~13 minutes(3.5x faster).
  - With 16 parallel workers, achieved ~3.7x speedup over pgHead.
  - With 32 workers, performance gains plateaued slightly, likely due
to more workers running on the machine and work done parallelly is not
that high to see further improvements.

Detailed Result:
-----------------
Case    Time_taken_in_replication(sec)    rep_time_in_minutes
faster_than_head
1. pgHead              2760.791     46.01318333    -
2. patched_#worker=2    1463.853    24.3975    1.88 times
3. patched_#worker=4    1031.376    17.1896    2.68 times
4. patched_#worker=8      781.007    13.0168    3.54 times
5. patched_#worker=16    741.108    12.3518    3.73 times
6. patched_#worker=32    787.203    13.1201    3.51 times
~~~~

Test-02: Measure number of transactions parallelized
-----------------------------------------------------
  - Used a top up patch to LOG the number of transactions applied by
parallel worker, applied by leader, and are depended.
  - The LOG output e.g. -
   ```
LOG:  parallelized_nxact: 11497254 dependent_nxact: 0 leader_applied_nxact: 600
```
  - parallelized_nxact: gives the number of parallelized transactions
  - dependent_nxact: gives the dependent transactions
  - leader_applied_nxact: gives the transactions applied by leader worker
  (the required top-up v1-002 patch is attached.)

  Observations:
----------------
  - With 4 to 8 parallel workers, ~80%-98% transactions are parallelized
  - As the number of workers increased, the parallelized percentage
increased and reached 99.99% with 32 workers.

Detailed Result:
-----------------
case1: #parallel_workers = 2(default)
   #total_pgbench_txns = 24745648
     parallelized_nxact = 14439480 (58.35%)
     dependent_nxact    = 16 (0.00006%)
     leader_applied_nxact = 10306153 (41.64%)

case2: #parallel_workers = 4
   #total_pgbench_txns = 24776108
     parallelized_nxact = 19666593 (79.37%)
     dependent_nxact    = 212 (0.0008%)
     leader_applied_nxact = 5109304 (20.62%)

case3: #parallel_workers = 8
   #total_pgbench_txns = 24821333
     parallelized_nxact = 24397431 (98.29%)
     dependent_nxact    = 282 (0.001%)
     leader_applied_nxact = 423621 (1.71%)

case4: #parallel_workers = 16
   #total_pgbench_txns = 24938255
     parallelized_nxact = 24937754 (99.99%)
     dependent_nxact    = 142 (0.0005%)
     leader_applied_nxact = 360 (0.0014%)

case5: #parallel_workers = 32
   #total_pgbench_txns = 24769474
     parallelized_nxact = 24769135 (99.99%)
     dependent_nxact    = 312 (0.0013%)
     leader_applied_nxact = 28 (0.0001%)

~~~~~
The scripts used for above tests are attached.

Next, I plan to extend the testing to larger workloads by running
pgbench for 20–30 minutes.
We will also benchmark performance across different workload types to
evaluate the improvements once the patch has matured further.

--
Thanks,
Nisha


I also did some benchmarking of the proposed parallel apply patch and compare it with my prewarming approach. And parallel apply is significantly more efficient than prefetch (it is expected).

So I had two tests (more details here):

https://www.postgresql.org/message-id/flat/84ed36b8-7d06-4945-9a6b-3826b3f999a6%40garret.ru#70b45c44814c248d3d519a762f528753

One is performing random updates and another - inserts with random key.
I stop subscriber, apply workload at publisher during 100 seconds and then measure how long time it will take subscriber to caught up.

update test (with 8 parallel apply workers):

    master:           8:30 min
    prefetch:         2:05 min
    parallel apply: 1:30 min

insert test (with 8 parallel apply workers):

    master:           9:20 min
    prefetch:         3:08 min
    parallel apply: 1:54 min



Reply via email to