On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <and...@anarazel.de> wrote: > > Hi, > > On 2016-03-15 10:47:12 +0530, Amit Kapila wrote: > > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids, > > * Record the final state of transaction entries in the commit log for > > * all entries on a single page. Atomic only on this page. > > * > > + * Group the status update for transactions. This improves the efficiency > > + * of the transaction status update by reducing the number of lock > > + * acquisitions required for it. To achieve the group transaction status > > + * update, we need to populate the transaction status related information > > + * in shared memory and doing it for overflowed sub-transactions would need > > + * a big chunk of shared memory, so we are not doing this optimization for > > + * such cases. This optimization is only applicable if the transaction and > > + * all child sub-transactions belong to same page which we presume to be the > > + * most common case, we might be able to apply this when they are not on same > > + * page, but that needs us to map sub-transactions in proc's XidCache based > > + * on pageno for which each time a group leader needs to set the transaction > > + * status and that can lead to some performance penalty as well because it > > + * needs to be done after acquiring CLogControlLock, so let's leave that > > + * case for now. We don't do this optimization for prepared transactions > > + * as the dummy proc associated with such transactions doesn't have a > > + * semaphore associated with it and the same is required for group status > > + * update. We choose not to create a semaphore for dummy procs for this > > + * purpose as the advantage of using this optimization for prepared transactions > > + * is not clear. > > + * > > I think you should try to break up some of the sentences, one of them > spans 7 lines. >
Okay, I will try to do so in next version. > I'm actually rather unconvinced that it's all that common that all > subtransactions are on one page. If you have concurrency - otherwise > there'd be not much point in this patch - they'll usually be heavily > interleaved, no? You can argue that you don't care about subxacts, > because they're more often used in less concurrent scenarios, but if > that's the argument, it should actually be made. > Note, that we are doing it only when a transaction has less than equal to 64 sub transactions. Now, I am not denying from the fact that there will be cases where subtransactions won't fall into different pages, but I think chances of such transactions to participate in group mode will be less and this patch is mainly targeting scalability for short transactions. > > > * Otherwise API is same as TransactionIdSetTreeStatus() > > */ > > static void > > TransactionIdSetPageStatus(TransactionId xid, int nsubxids, > > TransactionId *subxids, XidStatus status, > > - XLogRecPtr lsn, int pageno) > > + XLogRecPtr lsn, int pageno, > > + bool all_xact_same_page) > > +{ > > + /* > > + * If we can immediately acquire CLogControlLock, we update the status > > + * of our own XID and release the lock. If not, use group XID status > > + * update to improve efficiency and if still not able to update, then > > + * acquire CLogControlLock and update it. > > + */ > > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE)) > > + { > > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno); > > + LWLockRelease(CLogControlLock); > > + } > > + else if (!all_xact_same_page || > > + nsubxids > PGPROC_MAX_CACHED_SUBXIDS || > > + IsGXactActive() || > > + !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno)) > > + { > > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); > > + > > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno); > > + > > + LWLockRelease(CLogControlLock); > > + } > > +} > > > > This code is a bit arcane. I think it should be restructured to > a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids > > PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional > lock acquire first can be rather expensive. The previous version (v5 - [1]) has code that way, but that adds few extra instructions for single client case and I was seeing minor performance regression for single client case due to which it has been changed as per current code. > b) I'd rather see an explicit fallback for the > !TransactionGroupUpdateXidStatus case, this way it's too hard to > understand. It's also harder to add probes to detect whether that > Considering above reply to (a), do you want to see it done as a separate else if loop in patch? > > > + > > +/* > > + * When we cannot immediately acquire CLogControlLock in exclusive mode at > > + * commit time, add ourselves to a list of processes that need their XIDs > > + * status update. > > At this point my "ABA Problem" alarm goes off. If it's not an actual > danger, can you please document close by, why not? > Why this won't lead to ABA problem is explained below in comments. Refer + /* + * Now that we've got the lock, clear the list of processes waiting for + * group XID status update, saving a pointer to the head of the list. + * Trying to pop elements one at a time could lead to an ABA problem. + */ > > > The first process to add itself to the list will acquire > > + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal > > + * on behalf of all group members. This avoids a great deal of contention > > + * around CLogControlLock when many processes are trying to commit at once, > > + * since the lock need not be repeatedly handed off from one committing > > + * process to the next. > > + * > > + * Returns true, if transaction status is updated in clog page, else return > > + * false. > > + */ > > +static bool > > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status, > > + XLogRecPtr lsn, int pageno) > > +{ > > + volatile PROC_HDR *procglobal = ProcGlobal; > > + PGPROC *proc = MyProc; > > + uint32 nextidx; > > + uint32 wakeidx; > > + int extraWaits = -1; > > + > > + /* We should definitely have an XID whose status needs to be updated. */ > > + Assert(TransactionIdIsValid(xid)); > > + > > + /* > > + * Add ourselves to the list of processes needing a group XID status > > + * update. > > + */ > > + proc->clogGroupMember = true; > > + proc->clogGroupMemberXid = xid; > > + proc->clogGroupMemberXidStatus = status; > > + proc->clogGroupMemberPage = pageno; > > + proc->clogGroupMemberLsn = lsn; > > + while (true) > > + { > > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst); > > + > > + /* > > + * Add the proc to list, if the clog page where we need to update the > > + * current transaction status is same as group leader's clog page. > > + * There is a race condition here such that after doing the below > > + * check and before adding this proc's clog update to a group, if the > > + * group leader already finishes the group update for this page and > > + * becomes group leader of another group which updates different clog > > + * page, then it will lead to a situation where a single group can > > + * have different clog page updates. Now the chances of such a race > > + * condition are less and even if it happens, the only downside is > > + * that it could lead to serial access of clog pages from disk if > > + * those pages are not in memory. Tests doesn't indicate any > > + * performance hit due to different clog page updates in same group, > > + * however in future, if we want to improve the situation, then we can > > + * detect the non-group leader transactions that tries to update the > > + * different CLOG page after acquiring CLogControlLock and then mark > > + * these transactions such that after waking they need to perform CLOG > > + * update via normal path. > > + */ > > Needs a good portion of polishing. > > > > + if (nextidx != INVALID_PGPROCNO && > > + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage) > > + return false; > > I think we're returning with clogGroupMember = true - that doesn't look > right. > I think it won't create a problem, but surely it is not good to return as true, will change in next version of patch. > > > + pg_atomic_write_u32(&proc->clogGroupNext, nextidx); > > + > > + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst, > > + &nextidx, > > + (uint32) proc->pgprocno)) > > + break; > > + } > > So this indeed has ABA type problems. And you appear to be arguing above > that that's ok. Need to ponder that for a bit. > > So, we enqueue ourselves as the *head* of the wait list, if there's > other waiters. Seems like it could lead to the first element after the > leader to be delayed longer than the others. > It will not matter because we are waking the queued process only once we are done with xid status update. > > FWIW, You can move the nextidx = part of out the loop, > pgatomic_compare_exchange will update the nextidx value from memory; no > need for another load afterwards. > Not sure, if I understood which statement you are referring here (are you referring to atomic read operation) and how can we save the load operation? > > > + /* > > + * If the list was not empty, the leader will update the status of our > > + * XID. It is impossible to have followers without a leader because the > > + * first process that has added itself to the list will always have > > + * nextidx as INVALID_PGPROCNO. > > + */ > > + if (nextidx != INVALID_PGPROCNO) > > + { > > + /* Sleep until the leader updates our XID status. */ > > + for (;;) > > + { > > + /* acts as a read barrier */ > > + PGSemaphoreLock(&proc->sem); > > + if (!proc->clogGroupMember) > > + break; > > + extraWaits++; > > + } > > + > > + Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO); > > + > > + /* Fix semaphore count for any absorbed wakeups */ > > + while (extraWaits-- > 0) > > + PGSemaphoreUnlock(&proc->sem); > > + return true; > > + } > > + > > + /* We are the leader. Acquire the lock on behalf of everyone. */ > > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); > > + > > + /* > > + * Now that we've got the lock, clear the list of processes waiting for > > + * group XID status update, saving a pointer to the head of the list. > > + * Trying to pop elements one at a time could lead to an ABA problem. > > + */ > > + while (true) > > + { > > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst); > > + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst, > > + &nextidx, > > + INVALID_PGPROCNO)) > > + break; > > + } > > Hm. It seems like you should should simply use pg_atomic_exchange_u32(), > rather than compare_exchange? > We need to remember the head of list to wake up the processes due to which I think above loop is required. > > > diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c > > index c4fd9ef..120b9c0 100644 > > --- a/src/backend/access/transam/twophase.c > > +++ b/src/backend/access/transam/twophase.c > > @@ -177,7 +177,7 @@ static TwoPhaseStateData *TwoPhaseState; > > /* > > * Global transaction entry currently locked by us, if any. > > */ > > -static GlobalTransaction MyLockedGxact = NULL; > > +GlobalTransaction MyLockedGxact = NULL; > > Hm, I'm doubtful it's worthwhile to expose this, just so we can use an > inline function, but whatever. > I have done considering this as a hot-path to save an additional function call, but can change if you think so. > > > +#include "access/clog.h" > > #include "access/xlogdefs.h" > > #include "lib/ilist.h" > > #include "storage/latch.h" > > @@ -154,6 +155,17 @@ struct PGPROC > > > > uint32 wait_event_info; /* proc's wait information */ > > > > + /* Support for group transaction status update. */ > > + bool clogGroupMember; /* true, if member of clog group */ > > + pg_atomic_uint32 clogGroupNext; /* next clog group member */ > > + TransactionId clogGroupMemberXid; /* transaction id of clog group member */ > > + XidStatus clogGroupMemberXidStatus; /* transaction status of clog > > + * group member */ > > + int clogGroupMemberPage; /* clog page corresponding to > > + * transaction id of clog group member */ > > + XLogRecPtr clogGroupMemberLsn; /* WAL location of commit record for > > + * clog group member */ > > + > > Man, we're surely bloating PGPROC at a prodigious rate. > > > That's my first pass over the code itself. > > > Hm. Details aside, what concerns me most is that the whole group > mechanism, as implemented, only works als long as transactions only span > a short and regular amount of time. > Yes, thats the main case which will be targeted by this patch and I think there are many such cases in OLTP workloads where there are very short transactions. > > If I understand correctly, without having followed the thread, the > reason you came up with this batching on a per-page level is to bound > the amount of effort spent by the leader; and thus bound the latency? > This is mainly to save the case where multiple pages are not-in-memory and leader needs to perform the I/O serially. Refer mail [2] for point raised by Robert. > I think it's worthwhile to create a benchmark that does something like > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time); > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms, > completely realistic values for network RTT + application computation), > the success rate of group updates shrinks noticeably. > I think it will happen that way, but what do we want to see with that benchmark? I think the results will be that for such a workload either there is no benefit or will be very less as compare to short transactions. [1] - http://www.postgresql.org/message-id/CAA4eK1KUVPxBcGTdOuKyvf5p1sQ0HeUbSMbTxtQc=p65oxi...@mail.gmail.com [2] - http://www.postgresql.org/message-id/CA+TgmoahCx6XgprR=p5==cf0g9uhshsjxvdwduehn9h2mv0...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com