On 2016-01-11 19:39:14 +0000, Simon Riggs wrote:
> Currently, the patch reuses all of the code related to reading/write state
> files, so it is the minimal patch that can implement the important things
> for performance. The current patch succeeds in its goal to improve
> performance, so I personally see no further need for code churn.

Sorry, I don't buy that argument. This approach leaves us with a bunch
of code related to statefiles that's barely ever going to be exercised,
and leaves the performance bottleneck on WAL replay in place.

> As you suggest, we could also completely redesign the state file mechanism
> and/or put it in WAL at checkpoint. That's all very nice but is much more
> code and doesn't anything more for performance, since the current mainline
> path writes ZERO files at checkpoint.

Well, on the primary, yes.

> If you want that for some other reason or refactoring, I won't stop
> you, but its a separate patch for a separate purpose.

Maintainability/complexity very much has to be considered during review
and can't just be argued away with "but this is what we implemented".

> - *           In order to survive crashes and shutdowns, all prepared
> - *           transactions must be stored in permanent storage. This includes
> - *           locking information, pending notifications etc. All that state
> - *           information is written to the per-transaction state file in
> - *           the pg_twophase directory.
> + *           Information to recover prepared transactions in case of crash is
> + *           now stored in WAL for the common case. In some cases there will 
> be
> + *           an extended period between preparing a GXACT and commit/abort, 
> in

Absolutely minor: The previous lines were differently indented (two tabs
before, one space + two tabs now), which will probably mean pgindent
will yank all of it around, besides looking confusing with different tab

> *             * In case of crash replay will move data from xlog to files, if 
> that
> *               hasn't happened before. XXX TODO - move to shmem in replay 
> also

This is a bit confusing - causing my earlier confusion about using
XlogReadTwoPhaseData in recovery - because what this actually means is
that we get the data from normal WAL replay, not our new way of getting
things from the WAL.

> @@ -772,7 +769,7 @@ TwoPhaseGetGXact(TransactionId xid)
>        * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be 
> called
>        * repeatedly for the same XID.  We can save work with a simple cache.
>        */
> -     if (xid == cached_xid)
> +     if (xid == cached_xid && cached_gxact)
>               return cached_gxact;

What's that about? When can cached_xid be be equal xid and cached_gxact
not set? And why did that change with this patch?

> /*
>  * Finish preparing state file.
>  *
>  * Calculates CRC and writes state file to WAL and in pg_twophase directory.
>  */
> void
> EndPrepare(GlobalTransaction gxact)

In contrast to that comment we're not writing to pg_twophase anymore.

>       /*
>        * If the file size exceeds MaxAllocSize, we won't be able to read it in
>        * ReadTwoPhaseFile. Check for that now, rather than fail at commit 
> time.
>        */
>       if (hdr->total_len > MaxAllocSize)
>               ereport(ERROR,
>                               (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>                                errmsg("two-phase state file maximum length 
> exceeded")));

Outdated comment.

> +/*
> + * Reads 2PC data from xlog. During checkpoint this data will be moved to
> + * twophase files and ReadTwoPhaseFile should be used instead.
> + */
> +static void
> +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
> +{
> +     XLogRecord *record;
> +     XLogReaderState *xlogreader;
> +     char       *errormsg;
> +
> +     xlogreader = XLogReaderAllocate(&logical_read_local_xlog_page, NULL);
> +     if (!xlogreader)
> +             ereport(ERROR,
> +                             (errcode(ERRCODE_OUT_OF_MEMORY),
> +                              errmsg("out of memory"),
> +                              errdetail("Failed while allocating an XLog 
> reading processor.")));

Creating and deleting an xlogreader for every 2pc transaction isn't
particularly efficient. Reading the 2pc state from WAL will often also
mean hitting disk if there's significant WAL volume (we even hint that
we want the cache to be throw away for low wal_level!).

If we really go this way, we really need a) a comment here explaining
why timelines are never an issue b) an assert, preventing to be called
during recovery.

> +     record = XLogReadRecord(xlogreader, lsn, &errormsg);
> +     if (record == NULL ||
> +             XLogRecGetRmid(xlogreader) != RM_XACT_ID ||
> +             (XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != 
> +             ereport(ERROR,
> +                             (errcode_for_file_access(),
> +                              errmsg("could not read two-phase state from 
> xlog at %X/%X",
> +                                                     (uint32) (lsn >> 32),
> +                                                     (uint32) lsn)));

I think the record == NULL case should be handled separately (printing
->errormsg), and XLogRecGetRmid(xlogreader) != RM_XACT_ID &
(XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)
should get a more descriptive error message.

> /*
>  * Scan a 2PC state file (already read into memory by ReadTwoPhaseFile)
>  * and call the indicated callbacks for each 2PC record.
>  */
> static void
> ProcessRecords(char *bufptr, TransactionId xid,
>                          const TwoPhaseCallback callbacks[])

The data isn't neccesarily coming from a statefile anymore.

>  void
>  CheckPointTwoPhase(XLogRecPtr redo_horizon)
>  {
> -     TransactionId *xids;
> -     int                     nxids;
> -     char            path[MAXPGPATH];
>       int                     i;
> +     int                     n = 0;

Maybe also add a quick exit for when this is called during recovery?

> +     /*
> +      * We are expecting there to be zero GXACTs that need to be
> +      * copied to disk, so we perform all I/O while holding
> +      * TwoPhaseStateLock for simplicity. This prevents any new xacts
> +      * from preparing while this occurs, which shouldn't be a problem
> +      * since the presence of long-lived prepared xacts indicates the
> +      * transaction manager isn't active.

It's not *that* unlikely. Depending on settings the time between the
computation of the redo pointer and CheckPointTwoPhase() isn't
necessarily that large.

I wonder if we can address the replay performance issue significantly
enough by simply not fsyncing in RecreateTwoPhaseFile() during WAL
replay. If we make CheckPointTwoPhase() do that for the relevant 2pc
state files, we ought to be good, no?  Now that'd still not get close to
the performance on the primary (we do many more file creations!), but
it'd remove the most expensive part, the fsync.


Andres Freund

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to