RE: Conflict detection for update_deleted in logical replication

Hayato Kuroda (Fujitsu) Sun, 19 Jan 2025 22:54:32 -0800

Dear hackers,

I've created a new script which simulates that user reduce the workload on the
publisher side.  Attached zip file contains a script, execution log and pgbench
outputs. Experiments were done with v24 patch set.


Abstract
======

In this test the conflict slot could be invalidated as expected when the 
workload
on the publisher was high, and it would not get invalidated anymore after 
reducing
the workload. This shows even if the slot has been invalidated once, users can
continue to detect the update_deleted conflict by reduce the workload on the 
publisher.
Also, the transaction per second on the subscriber side can be mostly same as
retain_conflict_info = off case after reducing workload on the pub.

Workload
========
v23_measure.sh is a script I used. It is bit complex but mostly done below 
things:

1. Construct a pub-sub replication system.
2. Run a pgbench (tcp-b like workload) on both nodes. Initially the parallelism
   of pgbench is 30 on both nodes. While running the benchmark TPS has been 
replicated
   once per 1 second.
3. Check the status of the conflict slot periodically.
4. If the conflict slot is invalidated, stop the pgbench for both nodes.
5. Disable the retain_conflict_info option and wait until the conflict slot is 
dropped.
6. Wait until all the changes on the publisher is replicated to the subscriber.
7. Enable the retain_conflict_info and wait until the conflict slot is created.
8. Re-run the pgbench on both nodes. At that time, the parallelism for the 
publisher
   side is cut in half.
9. loop step 3-8 until the total benchmark time becomes 900s.

Parameters
==========

Publisher GUCs:
shared_buffers = '30GB'
max_wal_size = 20GB
min_wal_size = 10GB
wal_level = logical

Subscriber GUCs:

autovacuum_naptime = '30s'
shared_buffers = '30GB'
max_wal_size = 20GB
min_wal_size = 10GB
track_commit_timestamp = on

max_conflict_retention_duration is varied twice, 60s and 120s.

Results for max_conflict_retention_duration = 60s
================================

Parallelism of the publisher side is reduced till 30->15->7->3 and finally the
conflict slot is not invalidated. Below tables show 1) parallelism of the 
bgbench run,
2) time period for the parallelism, and 3) observed TPS of each iterations.

Publisher side   
nclients        Ran duration (s)        TPS
30      80      34587.9
15      83      19148.2
7       87      9609.1
3       647     4120.7

subscriber side  
nclients        Ran duration (s)        TPS
30      80      10688
30      83      10834
30      87      12327.5
30      647     33300.1


For 30/15/7 cases, the conflict slot has been invalidated around 80s later, but
it can survive for parallelism = 3. At that time the TPS on the subscriber side
becomes mostly same as the publisher (nclients=30).

Results for max_conflict_retention_duration = 120s
=================================

The trend was mostly same as 60s case.

Publisher side   
nclients        Ran duration    TPS
30      155     28979.3
15      157     19333.9
7       196     9875.2
3       389     4539

subscriber side  
nclients        Ran duration    TPS
30      155     5925
30      157     6912
30      196     9157.1
30      389     35736.6

Noticed
=====

While creating the script, I found that step 6 (Wait until all the changes on 
the
publisher is replicated to the subscriber) was necessary. If it was skipped,
the slot would be invalidated soon. This is because the remained changes are not
replicated to the subscriber side yet and the catchup is delayed due to them.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

<<attachment: resulsts.zip>>

RE: Conflict detection for update_deleted in logical replication

Reply via email to