Dear All,
I'd like to present and talk about a problem when 2PC transactions are applied 
quite slowly on a replica during logical replication. There is a master and a 
replica with established logical replication from the master to the replica 
with twophase = true. With some load level on the master, the replica starts to 
lag behind the master, and the lag will be increasing. We have to significantly 
decrease the load on the master to allow replica to complete the catchup. Such 
problem may create significant difficulties in the production. The problem 
appears at least on REL_16_STABLE branch.
To reproduce the problem:
 * Setup logical replication from master to replica with subscription parameter 
twophase =  true. * Create some intermediate load on the master (use pgbench 
with custom sql with prepare+commit) * Optionally switch off the replica for 
some time (keep load on master). * Switch on the replica and wait until it 
reaches the master.
The replica will never reach the master with even some low load on the master. 
If to remove the load, the replica will reach the master for much greater time, 
than expected. I tried the same for regular transactions, but such problem 
doesn't appear even with a decent load.
I think, the main proplem of 2PC catchup bad performance - the lack of 
asynchronous commit support for 2PC. For regular transactions asynchronous 
commit is used on the replica by default (subscrition sycnronous_commit = off). 
It allows the replication worker process on the replica to avoid fsync 
(XLogFLush) and to utilize 100% CPU (the background wal writer or checkpointer 
will do fsync). I agree, 2PC are mostly used in multimaster configurations with 
two or more nodes which are performed synchronously, but when the node in 
catchup (node is not online in a multimaster cluster), asynchronous commit have 
to be used to speedup the catchup.
There is another thing that affects on the disbalance of the master and replica 
performance. When the master executes requestes from multiple clients, there is 
a fsync optimization takes place in XLogFlush. It allows to decrease the number 
of fsync in case when a number of parallel backends write to the WAL 
simultaneously. The replica applies received transactions in one thread 
sequentially, such optimization is not applied.
I see some possible solutions:
 * Implement asyncronous commit for 2PC transactions. * Do some hacking with 
enableFsync when it is possible.
I think, asynchronous commit support for 2PC transactions should significantly 
increase replica performance and help to solve this problem. I tried to 
implement it (like for usual transactions) but I've found another problem: 2PC 
state is stored in WAL on prepare, on commit we have to read 2PC state from WAL 
but the read is delayed until WAL is flushed by the background wal writer (read 
LSN should be less than flush LSN). Storing 2PC state in a shared memory (as it 
proposed earlier) may help.

I used the following query to monitor the catchup progress on the master:SELECT 
sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
I used the following script for pgbench to the master:SELECT 
md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;
 
What do you think?
 
With best regards,
Vitaly Davydov

Reply via email to