Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

Amit Kapila Tue, 22 Mar 2016 05:50:36 -0700

On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <and...@anarazel.de> wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId
*subxids,
> >   * Record the final state of transaction entries in the commit log for
> >   * all entries on a single page.  Atomic only on this page.
> >   *
> > + * Group the status update for transactions. This improves the
efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it.  To achieve the group transaction
status
> > + * update, we need to populate the transaction status related
information
> > + * in shared memory and doing it for overflowed sub-transactions would
need
> > + * a big chunk of shared memory, so we are not doing this optimization
for
> > + * such cases. This optimization is only applicable if the transaction
and
> > + * all child sub-transactions belong to same page which we presume to
be the
> > + * most common case, we might be able to apply this when they are not
on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache
based
> > + * on pageno for which each time a group leader needs to set the
transaction
> > + * status and that can lead to some performance penalty as well
because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave
that
> > + * case for now.  We don't do this optimization for prepared
transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group
status
> > + * update.  We choose not to create a semaphore for dummy procs for
this
> > + * purpose as the advantage of using this optimization for prepared
transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>


Okay, I will try to do so in next version.

> I'm actually rather unconvinced that it's all that common that all
> subtransactions are on one page. If you have concurrency - otherwise
> there'd be not much point in this patch - they'll usually be heavily
> interleaved, no?  You can argue that you don't care about subxacts,
> because they're more often used in less concurrent scenarios, but if
> that's the argument, it should actually be made.
>

Note, that we are doing it only when a transaction has less than equal to
64 sub transactions.  Now, I am not denying from the fact that there will
be cases where subtransactions won't fall into different pages, but I think
chances of such transactions to participate in group mode will be less and
this patch is mainly targeting scalability for short transactions.

>
> >   * Otherwise API is same as TransactionIdSetTreeStatus()
> >   */
> >  static void
> >  TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> >                                                  TransactionId
*subxids, XidStatus status,
> > -                                                XLogRecPtr lsn, int
pageno)
> > +                                                XLogRecPtr lsn, int
pageno,
> > +                                                bool
all_xact_same_page)
> > +{
> > +     /*
> > +      * If we can immediately acquire CLogControlLock, we update the
status
> > +      * of our own XID and release the lock.  If not, use group XID
status
> > +      * update to improve efficiency and if still not able to update,
then
> > +      * acquire CLogControlLock and update it.
> > +      */
> > +     if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > +     {
> > +             TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > +             LWLockRelease(CLogControlLock);
> > +     }
> > +     else if (!all_xact_same_page ||
> > +                      nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > +                      IsGXactActive() ||
> > +                      !TransactionGroupUpdateXidStatus(xid, status,
lsn, pageno))
> > +     {
> > +             LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > +             TransactionIdSetPageStatusInternal(xid, nsubxids,
subxids, status, lsn, pageno);
> > +
> > +             LWLockRelease(CLogControlLock);
> > +     }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
>    PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
>    lock acquire first can be rather expensive.

The previous version (v5 - [1]) has code that way, but that adds few extra
instructions for single client case and I was seeing minor performance
regression for single client case due to which it has been changed as per
current code.

> b) I'd rather see an explicit fallback for the
>    !TransactionGroupUpdateXidStatus case, this way it's too hard to
>    understand. It's also harder to add probes to detect whether that
>

Considering above reply to (a), do you want to see it done as a separate
else if loop in patch?

>
> > +
> > +/*
> > + * When we cannot immediately acquire CLogControlLock in exclusive
mode at
> > + * commit time, add ourselves to a list of processes that need their
XIDs
> > + * status update.
>
> At this point my "ABA Problem" alarm goes off. If it's not an actual
> danger, can you please document close by, why not?
>

Why this won't lead to ABA problem is explained below in comments. Refer

+ /*

+ * Now that we've got the lock, clear the list of processes waiting for

+ * group XID status update, saving a pointer to the head of the list.

+ * Trying to pop elements one at a time could lead to an ABA problem.

+ */


>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform
TransactionIdSetPageStatusInternal
> > + * on behalf of all group members.  This avoids a great deal of
contention
> > + * around CLogControlLock when many processes are trying to commit at
once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else
return
> > + * false.
> > + */
> > +static bool
> > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> > +
XLogRecPtr lsn, int pageno)
> > +{
> > +     volatile PROC_HDR *procglobal = ProcGlobal;
> > +     PGPROC     *proc = MyProc;
> > +     uint32          nextidx;
> > +     uint32          wakeidx;
> > +     int                     extraWaits = -1;
> > +
> > +     /* We should definitely have an XID whose status needs to be
updated. */
> > +     Assert(TransactionIdIsValid(xid));
> > +
> > +     /*
> > +      * Add ourselves to the list of processes needing a group XID
status
> > +      * update.
> > +      */
> > +     proc->clogGroupMember = true;
> > +     proc->clogGroupMemberXid = xid;
> > +     proc->clogGroupMemberXidStatus = status;
> > +     proc->clogGroupMemberPage = pageno;
> > +     proc->clogGroupMemberLsn = lsn;
> > +     while (true)
> > +     {
> > +             nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +
> > +             /*
> > +              * Add the proc to list, if the clog page where we need
to update the
> > +              * current transaction status is same as group leader's
clog page.
> > +              * There is a race condition here such that after doing
the below
> > +              * check and before adding this proc's clog update to a
group, if the
> > +              * group leader already finishes the group update for
this page and
> > +              * becomes group leader of another group which updates
different clog
> > +              * page, then it will lead to a situation where a single
group can
> > +              * have different clog page updates.  Now the chances of
such a race
> > +              * condition are less and even if it happens, the only
downside is
> > +              * that it could lead to serial access of clog pages from
disk if
> > +              * those pages are not in memory.  Tests doesn't indicate
any
> > +              * performance hit due to different clog page updates in
same group,
> > +              * however in future, if we want to improve the
situation, then we can
> > +              * detect the non-group leader transactions that tries to
update the
> > +              * different CLOG page after acquiring CLogControlLock
and then mark
> > +              * these transactions such that after waking they need to
perform CLOG
> > +              * update via normal path.
> > +              */
>
> Needs a good portion of polishing.
>
>
> > +             if (nextidx != INVALID_PGPROCNO &&
> > +                     ProcGlobal->allProcs[nextidx].clogGroupMemberPage
!= proc->clogGroupMemberPage)
> > +                     return false;
>
> I think we're returning with clogGroupMember = true - that doesn't look
> right.
>

I think it won't create a problem, but surely it is not good to return as
true, will change in next version of patch.

>
> > +             pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> > +
> > +             if
(pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +
         &nextidx,
> > +
         (uint32) proc->pgprocno))
> > +                     break;
> > +     }
>
> So this indeed has ABA type problems. And you appear to be arguing above
> that that's ok. Need to ponder that for a bit.
>
> So, we enqueue ourselves as the *head* of the wait list, if there's
> other waiters. Seems like it could lead to the first element after the
> leader to be delayed longer than the others.
>

It will not matter because we are waking the queued process only once we
are done with xid status update.

>
> FWIW, You can move the nextidx = part of out the loop,
> pgatomic_compare_exchange will update the nextidx value from memory; no
> need for another load afterwards.
>

Not sure, if I understood which statement you are referring here (are you
referring to atomic read operation) and how can we save the load operation?

>
> > +     /*
> > +      * If the list was not empty, the leader will update the status
of our
> > +      * XID. It is impossible to have followers without a leader
because the
> > +      * first process that has added itself to the list will always
have
> > +      * nextidx as INVALID_PGPROCNO.
> > +      */
> > +     if (nextidx != INVALID_PGPROCNO)
> > +     {
> > +             /* Sleep until the leader updates our XID status. */
> > +             for (;;)
> > +             {
> > +                     /* acts as a read barrier */
> > +                     PGSemaphoreLock(&proc->sem);
> > +                     if (!proc->clogGroupMember)
> > +                             break;
> > +                     extraWaits++;
> > +             }
> > +
> > +             Assert(pg_atomic_read_u32(&proc->clogGroupNext) ==
INVALID_PGPROCNO);
> > +
> > +             /* Fix semaphore count for any absorbed wakeups */
> > +             while (extraWaits-- > 0)
> > +                     PGSemaphoreUnlock(&proc->sem);
> > +             return true;
> > +     }
> > +
> > +     /* We are the leader.  Acquire the lock on behalf of everyone. */
> > +     LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > +     /*
> > +      * Now that we've got the lock, clear the list of processes
waiting for
> > +      * group XID status update, saving a pointer to the head of the
list.
> > +      * Trying to pop elements one at a time could lead to an ABA
problem.
> > +      */
> > +     while (true)
> > +     {
> > +             nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +             if
(pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +
         &nextidx,
> > +
         INVALID_PGPROCNO))
> > +                     break;
> > +     }
>
> Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> rather than compare_exchange?
>

We need to remember the head of list to wake up the processes due to which
I think above loop is required.

>
> > diff --git a/src/backend/access/transam/twophase.c
b/src/backend/access/transam/twophase.c
> > index c4fd9ef..120b9c0 100644
> > --- a/src/backend/access/transam/twophase.c
> > +++ b/src/backend/access/transam/twophase.c
> > @@ -177,7 +177,7 @@ static TwoPhaseStateData *TwoPhaseState;
> >  /*
> >   * Global transaction entry currently locked by us, if any.
> >   */
> > -static GlobalTransaction MyLockedGxact = NULL;
> > +GlobalTransaction MyLockedGxact = NULL;
>
> Hm, I'm doubtful it's worthwhile to expose this, just so we can use an
> inline function, but whatever.
>

I have done considering this as a hot-path to save an additional function
call, but can change if you think so.

>
> > +#include "access/clog.h"
> >  #include "access/xlogdefs.h"
> >  #include "lib/ilist.h"
> >  #include "storage/latch.h"
> > @@ -154,6 +155,17 @@ struct PGPROC
> >
> >       uint32          wait_event_info;        /* proc's wait
information */
> >
> > +     /* Support for group transaction status update. */
> > +     bool            clogGroupMember;        /* true, if member of
clog group */
> > +     pg_atomic_uint32 clogGroupNext;         /* next clog group member
*/
> > +     TransactionId clogGroupMemberXid;       /* transaction id of clog
group member */
> > +     XidStatus       clogGroupMemberXidStatus;               /*
transaction status of clog
> > +
                       * group member */
> > +     int                     clogGroupMemberPage;    /* clog page
corresponding to
> > +
       * transaction id of clog group member */
> > +     XLogRecPtr      clogGroupMemberLsn;             /* WAL location
of commit record for
> > +
       * clog group member */
> > +
>
> Man, we're surely bloating PGPROC at a prodigious rate.
>
>
> That's my first pass over the code itself.
>
>
> Hm. Details aside, what concerns me most is that the whole group
> mechanism, as implemented, only works als long as transactions only span
> a short and regular amount of time.
>

Yes, thats the main case which will be targeted by this patch and I think
there are many such cases in OLTP workloads where there are very short
transactions.

>
> If I understand correctly, without having followed the thread, the
> reason you came up with this batching on a per-page level is to bound
> the amount of effort spent by the leader; and thus bound the latency?
>

This is mainly to save the case where multiple pages are not-in-memory and
leader needs to perform the I/O serially. Refer mail [2] for point raised
by Robert.


> I think it's worthwhile to create a benchmark that does something like
> BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> completely realistic values for network RTT + application computation),
> the success rate of group updates shrinks noticeably.
>

I think it will happen that way, but what do we want to see with that
benchmark? I think the results will be that for such a workload either
there is no benefit or will be very less as compare to short transactions.


[1] -
http://www.postgresql.org/message-id/CAA4eK1KUVPxBcGTdOuKyvf5p1sQ0HeUbSMbTxtQc=p65oxi...@mail.gmail.com
[2] -
http://www.postgresql.org/message-id/CA+TgmoahCx6XgprR=p5==cf0g9uhshsjxvdwduehn9h2mv0...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

Reply via email to