[HACKERS] Group Commit

Heikki Linnakangas Thu, 29 Mar 2007 02:56:26 -0800

I've been working on the patch to enhance our group commit behavior. Thepatch is a dirty hack at the moment, but I'm settled on the algorithmI'm going to use and I know the issues involved.


Here's the patch as it is if you want to try it out:
http://community.enterprisedb.com/groupcommit-pghead-2.patch

but it needs a rewrite before being accepted. It'll only work on systemsthat use sysv semaphores, I needed to add a function to acquire asemaphore with timeout and I only did it for sysv_sema.c for now.

What are the chances of getting this in 8.3, assuming that I rewrite andsubmit a patch within the next week or two?



Algorithm
---------

Instead of starting a WAL flush immediately after a commit record isinserted, we wait a while to give other backends a chance to finishtheir transactions and have them flushed by the same fsync call. There'stwo things we can control: how many commits to wait for (commit groupsize), and for how long (timeout).


We try to estimate the optimal commit group size. The estimate is

commit group size = (# of commit records flushed + # of commit recordsarrived while fsyncing).

This is a relatively simple estimate that works reasonably well withvery short transactions, and the timeout limits the damage when theestimate is not working.

There's a lot more factors we could take into account in the estimate,for example:- # of backends and their states (affects how many are likely to commitsoon)

- amount of WAL written since last XLogFlush (affects the duration of fsync)

- when exaclty the commit records arrive (we don't want to wait 10 ms toget one more commit record in, when an fsync takes 11 ms)


but I wanted to keep this simple for now.

The timeout is currently hard-coded at 1 ms. I wanted to keep it shortcompared to the time it takes to fsync (somewhere in the 5-15 msdepending on hardware), to limit the damage when the algorithm isn'tgetting the estimate right. We could also vary the timeout, but I'm notsure how to calculate the optimal value and the real granularity willdepend on the system anyhow.


Implementation
--------------

To count the # of commits since last XLogFlush, I added a newXLogCtlCommit struct in shared memory:


typedef struct XLogCtlCommit
{
    slock_t    commit_lock;   /* protects the struct */

int commitCount; /* # of commit records inserted sinceXLogFlush */

    int        groupSize;     /* current commit group size */
    XLogRecPtr lastCommitPtr; /* location of the latest commit record */

PGPROC *waiter; /* process to signal when groupSize isreached */

} XLogCtlCommit;

Whenever a commit record is inserted in XLogInsert, commitCount isincremented and lastCommitPtr is updated.

When it reaches groupSize, the waiter-process is woken up.

In XLogFlush, after acquiring WALWriteLock, we wait until groupSize isreached (or timeout expires) before doing the flush.

Instead of the current logic to flush as much WAL as possible, we flushup to the last commit record. Flushing any more wouldn't save us anfsync later on, but might make the current fsync take longer. By doingthat, we avoid the conditional acquire of the WALInsertLock that's inthere currently. We make note of commitCount before starting the fsync;that's the # of commit records that arrived in time so that the fsyncwill flush them. Let's call that value "intime".

After the fsync is finished, we update the groupSize for the next round.The new groupSize is the current commitCount after the fsync, IOW thenumber of commit records arrived after the previous XLogFlush, includingthe time it took to do the fsync. We update the commitCount bydecrementing it by "intime".


Now we're ready for the next round, and we can release WALWriteLock.

WALWriteLock
------------

The above would work nicely, except that a normal lwlock doesn't playnicely. You can release and reacquire a lightwait lock in the same timeslice even when there's other backends queuing for the lock, effectivelycutting the queue.


Here's what sometimes happens, with 2 clients:

Client 1               Client 2
do work                do work
insert commit record   insert commit record
acquire WALWriteLock
                       try to acquire WALWriteLock, blocks
fsync
release WALWriteLock
begin new transaction
do work
insert commit record
reacquire WALWriteLock
wait for 2nd commit to arrive

Client 1 will eventually time out and commit just its own commit record.Client 2 should be released immediately after client 1 releases theWALWriteLock. It only needs to observe that its commit record hasalready been flushed and doesn't need to do anything.

To fix the above, and other race conditions like that, we need aspecialized WALWriteLock that orders the waiters by the commit recordXLogRecPtrs. WALWriteLockRelease wakes up all waiters that have theircommit record already flushed. They will just fall through withoutacquiring the lock.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

[HACKERS] Group Commit

Reply via email to