On Wed, Mar 15, 2017 at 4:48 PM, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote: >> bool valid; /* TRUE if PGPROC entry is in proc array >> */ >> bool ondisk; /* TRUE if prepare state file is on disk >> */ >> + bool inredo; /* TRUE if entry was added via xlog_redo >> */ >> We could have a set of flags here, that's the 3rd boolean of the >> structure used for a status. > > This is more of a cleanup and does not need to be part of this patch. This > can be a follow-on cleanup patch.
OK, that's fine for me. This patch is complicated enough anyway. After some time thinking about it, I have finally put my finger on what was itching me about this patch, and the answer is here: + * Replay of twophase records happens by the following rules: + * + * * On PREPARE redo we add the transaction to TwoPhaseState->prepXacts. + * We set gxact->inredo to true for such entries. + * + * * On Checkpoint we iterate through TwoPhaseState->prepXacts entries + * that have gxact->inredo set and are behind the redo_horizon. We + * save them to disk and also set gxact->ondisk to true. + * + * * On COMMIT/ABORT we delete the entry from TwoPhaseState->prepXacts. + * If gxact->ondisk is true, we delete the corresponding entry from + * the disk as well. + * + * * RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions() + * and PrescanPreparedTransactions() have been modified to go through + * gxact->inredo entries that have not made to disk yet. It seems to me that there should be an initial scan of pg_twophase at the beginning of recovery, discarding on the way with a WARNING entries that are older than the checkpoint redo horizon. This should fill in shmem entries using something close to PrepareRedoAdd(), and mark those entries as inredo. Then, at the end of recovery, PrescanPreparedTransactions does not need to look at the entries in pg_twophase. And that's the case as well of RecoverPreparedTransaction(). I think that you could get the patch much simplified this way, as any 2PC data can be fetched directly from WAL segments and there is no need to rely on scans of pg_twophase, this is replaced by scans of entries in TwoPhaseState. > I also managed to do some perf testing. > > Modified Stas' earlier scripts slightly: > > \set naccounts 100000 * :scale > \set from_aid random(1, :naccounts) > \set to_aid random(1, :naccounts) > \set delta random(1, 100) > \set scale :scale+1 > BEGIN; > UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = > :from_aid; > UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = > :to_aid; > PREPARE TRANSACTION ':client_id.:scale'; > COMMIT PREPARED ':client_id.:scale'; > > Created a base backup with scale factor 125 on an AWS t2.large instance. Set > up archiving and did a 20 minute run with the above script saving the WALs > in the archive. > > Then used recovery.conf to point to this WAL location and used the base > backup to recover. > > With this patch applied: 20s > Without patch: Stopped measuring after 5 minutes ;-) And that's really nice. -- Michael -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers