On Wed, Mar 15, 2017 at 4:48 PM, Nikhil Sontakke
<nikh...@2ndquadrant.com> wrote:
>>     bool        valid;          /* TRUE if PGPROC entry is in proc array
>> */
>>     bool        ondisk;         /* TRUE if prepare state file is on disk
>> */
>> +   bool        inredo;         /* TRUE if entry was added via xlog_redo
>> */
>> We could have a set of flags here, that's the 3rd boolean of the
>> structure used for a status.
> This is more of a cleanup and does not need to be part of this patch. This
> can be a follow-on cleanup patch.

OK, that's fine for me. This patch is complicated enough anyway.

After some time thinking about it, I have finally put my finger on
what was itching me about this patch, and the answer is here:

+ *      Replay of twophase records happens by the following rules:
+ *
+ *      * On PREPARE redo we add the transaction to TwoPhaseState->prepXacts.
+ *        We set gxact->inredo to true for such entries.
+ *
+ *      * On Checkpoint we iterate through TwoPhaseState->prepXacts entries
+ *        that have gxact->inredo set and are behind the redo_horizon. We
+ *        save them to disk and also set gxact->ondisk to true.
+ *
+ *      * On COMMIT/ABORT we delete the entry from TwoPhaseState->prepXacts.
+ *        If gxact->ondisk is true, we delete the corresponding entry from
+ *        the disk as well.
+ *
+ *      * RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions()
+ *        and PrescanPreparedTransactions() have been modified to go through
+ *        gxact->inredo entries that have not made to disk yet.

It seems to me that there should be an initial scan of pg_twophase at
the beginning of recovery, discarding on the way with a WARNING
entries that are older than the checkpoint redo horizon. This should
fill in shmem entries using something close to PrepareRedoAdd(), and
mark those entries as inredo. Then, at the end of recovery,
PrescanPreparedTransactions does not need to look at the entries in
pg_twophase. And that's the case as well of
RecoverPreparedTransaction(). I think that you could get the patch
much simplified this way, as any 2PC data can be fetched directly from
WAL segments and there is no need to rely on scans of pg_twophase,
this is replaced by scans of entries in TwoPhaseState.

> I also managed to do some perf testing.
> Modified Stas' earlier scripts slightly:
> \set naccounts 100000 * :scale
> \set from_aid random(1, :naccounts)
> \set to_aid random(1, :naccounts)
> \set delta random(1, 100)
> \set scale :scale+1
> UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid =
> :from_aid;
> UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid =
> :to_aid;
> PREPARE TRANSACTION ':client_id.:scale';
> COMMIT PREPARED ':client_id.:scale';
> Created a base backup with scale factor 125 on an AWS t2.large instance. Set
> up archiving and did a 20 minute run with the above script saving the WALs
> in the archive.
> Then used recovery.conf to point to this WAL location and used the base
> backup to recover.
> With this patch applied: 20s
> Without patch: Stopped measuring after 5 minutes ;-)

And that's really nice.

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to