On 01/26/2016 07:43 AM, Stas Kelvich wrote:
Thanks for reviews and commit!
As Simon and Andres already mentioned in this thread replay of twophase
transaction is significantly slower then the same operations in normal mode.
Major reason is that each state file is fsynced during replay and while it is
not a problem for recovery, it is a problem for replication. Under high 2pc
update load lag between master and async replica is constantly increasing (see
One way to improve things is to move fsyncs to restartpoints, but as we saw
previously it is a half-measure and just frequent calls to fopen can cause
Other option is to use the same scenario for replay that was used already
for non-recovery mode: read state files to memory during replay of prepare, and
if checkpoint/restartpoint occurs between prepare and commit move data to
files. On commit we can read xlog or files. So here is the patch that
implements this scenario for replay.
Patch is quite straightforward. During replay of prepare records
RecoverPreparedFromXLOG() is called to create memory state in GXACT, PROC,
PGPROC; on commit XlogRedoFinishPrepared() is called to clean up that state.
Also there are several functions (PrescanPreparedTransactions,
StandbyTransactionIdIsPrepared) that were assuming that during replay all
prepared xacts have files in pg_twophase, so I have extended them to check
Side effect of that behaviour is that we can see prepared xacts in
pg_prepared_xacts view on slave.
While this patch touches quite sensible part of postgres replay and there is
some rarely used code paths, I wrote shell script to setup master/slave
replication and test different failure scenarios that can happened with
instances. Attaching this file to show test scenarios that I have tested and
more importantly to show what I didn’t tested. Particularly I failed to
reproduce situation where StandbyTransactionIdIsPrepared() is called, may be
somebody can suggest way how to force it’s usage. Also I’m not too sure about
necessity of calling cache invalidation callbacks during
XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER
Tests shows that this patch increases speed of 2pc replay to the level when
replica can keep pace with master.
Graph: replica lag under a pgbench run for a 200 seconds with 2pc update transactions (80
connections, one update per 2pc tx, two servers with 12 cores each, 10GbE interconnect)
on current master and with suggested patch. Replica lag measured with "select
sent_location-replay_location as delay from pg_stat_replication;" each second.
* The patch needs a rebase against the latest TwoPhaseFileHeader change
* Rework the check.sh script into a TAP test case (src/test/recovery),
as suggested by Alvaro and Michael down thread
* Add documentation for RecoverPreparedFromXLOG
+ * that xlog record. We need just to clen up memmory state.
'clean' + 'memory'
+ * This is usually called after end-of-recovery checkpoint, so all 2pc
+ * files moved xlog to files. But if we restart slave when master is
+ * switched off this function will be called before checkpoint ans we
+ * to check PGXACT array as it can contain prepared transactions that
+ * didn't created any state files yet.
"We need to check the PGXACT array for prepared transactions that
doesn't have any state file in case of a slave restart with the master
+ * prepare xlog resords in shared memory in the same way as it
+ * We need such behaviour because speed of 2PC replay on
+ * be at least not slower than 2PC tx speed on master.
"We need this behaviour because the speed of the 2PC replay on the
replica should be at least the same as the 2PC transaction speed of the
I'll leave the 2REVIEWER section to Simon.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: