On Wed, Jul 9, 2025 at 5:39 PM Amit Kapila <[email protected]> wrote: > > On Tue, Jul 8, 2025 at 12:18 AM Masahiko Sawada <[email protected]> wrote: > > > > On Mon, Jul 7, 2025 at 12:03 PM Zhijie Hou (Fujitsu) > > <[email protected]> wrote: > > > > I think these performance regressions occur because at some point the > > subscriber can no longer keep up with the changes occurring on the > > publisher. This is because the publisher runs multiple transactions > > simultaneously, while the Subscriber applies them with one apply > > worker. When retain_conflict_info = on, the performance of the apply > > worker deteriorates because it retains dead tuples, and as a result it > > gradually cannot keep up with the publisher, the table bloats, and the > > TPS of pgbench executed on the subscriber is also affected. This > > happened when only 40 clients (or 15 clients according to the results > > of test 4?) were running simultaneously. > > > > I think here the primary reason is the speed of one apply worker vs. > 15 or 40 clients working on the publisher, and all the data is being > replicated. We don't see regression at 3 clients, which suggests apply > worker is able to keep up with that much workload. Now, we have > checked that if the workload is slightly different such that fewer > clients (say 1-3) work on same set of tables and then we make > different set of pub-sub pairs for all such different set of clients > (for example, 3 clients working on tables t1 and t2, other 3 clients > working on tables t3 and t4; then we can have 2 pub-sub pairs, one for > tables t1, t2, and other for t3-t4 ) then there is almost negligible > regression after enabling retain_conflict_info. Additionally, for very > large transactions that can be parallelized, we shouldn't see any > regression because those can be applied in parallel. >
Yes, in test case-03 [1], the performance drop(~50%) observed on the
subscriber side was primarily due to a single apply worker handling
changes from 40 concurrent clients on the publisher, which led to the
accumulation of dead tuples.
To validate this and simulate a more realistic workload, designed a
test as suggested above, where multiple clients update different
tables, and multiple subscriptions exist on the subscriber (one per
table set).
A custom pgbench script was created to run pgbench on the publisher,
with each client updating a unique set of tables. On the subscriber
side, created one subscription per set of tables. Each
publication-subscription pair handles a distinct table set.
Highlights
==========
- Two tests were done with two different workloads - 15 and 45
concurrent clients, respectively.
- No regression was observed when publisher changes were processed by
multiple apply workers on the subscriber.
Used source
===========
pgHead commit 62a17a92833 + v47 patch set
Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM
01. pgbench on both sides (with 15 clients)
=====================================
Setup:
- Publisher and Subscriber nodes are created with configurations:
autovacuum = false
shared_buffers = '30GB'
-- Also, worker and logical replication related parameters were
increased as per requirement (see attached scripts for details).
Workload:
- The publisher has 15 sets of pgbench tables: Each set includes four
tables: pgbench_accounts, pgbench_tellers, pgbench_branches, and
pgbench_history, named as:
pgbench_accounts_0, pgbench_tellers_0, ..., pgbench_accounts_14,
pgbench_tellers_14, etc.
- Ran pgbench with 15 clients for the *both side*.
-- On publisher, each client updates *only one* set of pgbench
tables: e.g., client '0' updates the pgbench_xx_0 tables, client '1'
updates pgbench_xx_1 tables, and so on.
-- On Subscriber, there exists one subscription per set of tables
of the publisher, i.e, there is one apply worker consuming changes
corresponding to each client. So, #subscriptions on subscriber(15) =
#clients on publisher(15).
- On subscriber, the default pgbench workload is also run with 15 clients.
- The duration was 5 minutes, and the measurement was repeated 3 times.
Test Scenarios & Results:
Publisher:
- pgHead : Median TPS = 10386.93507
- pgHead + patch : Median TPS = 10187.0887 (TPS reduced ~2%)
Subscriber:
- pgHead : Median TPS = 10006.3903
- pgHead + patch : Median TPS = 9986.269682 (TPS reduced ~0.2%)
Observation:
- No performance regression was observed on either the publisher or
subscriber with the patch applied.
- The TPS drop was under 2% on both sides, within expected case to
case variation range.
Detailed Results Table:
On publisher:
#run pgHEAD pgHead+patch(ON)
1 10477.26438 10029.36155
2 10261.63429 10187.0887
3 10386.93507 10750.86231
median 10386.93507 10187.0887
On subscriber:
#run pgHEAD pgHead+patch(ON)
1 10261.63429 9813.114002
2 9962.914457 9986.269682
3 10006.3903 10580.13015
median 10006.3903 9986.269682
~~~~
02. pgbench on both sides (with 45 clients)
=====================================
Setup:
- same as case 01.
Workload:
- Publisher has the same 15 sets of pgbench tables as in case-01 and
3 clients will be updating one set of tables.
- Ran pgbench with 45 clients for the *both side*.
-- On publisher, each client updates *three* set of pgbench tables:
e.g., clients '0','15' and '30' update pgbench_xx_0 tables, clients
'1', '16', and '31' update pgbench_xx_1 tables, and so on.
-- On Subscriber, there exists one subscription per set of tables
of the publisher, i.e, there is one apply worker consuming changes
corresponding to *three* clients of the publisher.
- On subscriber, the default pgbench workload is also run with 45 clients.
- The duration was 5 minutes, and the measurement was repeated 3 times.
Test Scenarios & Results:
Publisher:
- pgHead : Median TPS = 13845.7381
- pgHead + patch : Median TPS = 13553.682 (TPS reduced ~2%)
Subscriber:
- pgHead : Median TPS = 10080.54686
- pgHead + patch : Median TPS = 9908.304381 (TPS reduced ~1.7%)
Observation:
- No significant performance regression observed on either the
publisher or subscriber with the patch applied.
- The TPS drop was under 2% on both sides, within expected case to
case variation range.
Detailed Results Table:
On publisher:
#run pgHEAD pgHead+patch(ON)
1 14446.62404 13616.81375
2 12988.70504 13425.22938
3 13845.7381 13553.682
median 13845.7381 13553.682
On subscriber:
#run pgHEAD pgHead+patch(ON)
1 10505.47481 9908.304381
2 9963.119531 9843.280308
3 10080.54686 9987.983147
median 10080.54686 9908.304381
~~~~
The scripts used to perform above tests are attached.
[1]
https://www.postgresql.org/message-id/OSCPR01MB1496663AED8EEC566074DFBC9F54CA%40OSCPR01MB14966.jpnprd01.prod.outlook.com
test2_files_perf_15-clients.tar
Description: Unix tar archive
test2_files_perf_45-clients.tar
Description: Unix tar archive
