Hi,

> On Wed, Aug 5, 2015 at 10:59 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> OK, committed.

I spent some time today reviewing the commited patch. So far my only
major complaint is that I think the comments are only insufficiently
documenting the approach taken:
Stuff like avoiding ABA type problems by clearling the list entirely and
it being impossible that entries end up on the list too early absolutely
needs to be documented explicitly.

I think I found a few minor correctness issues. Mostly around the fact
that we, so far, tried to use semaphores in a way that copes with
unrelated unlocks "arriving". I actually think I removed all of the
locations that caused that to happen, but I don't think we should rely
on that fact. Looking at the following two pieces of code:
        /* If the list was not empty, the leader will clear our XID. */
        if (nextidx != INVALID_PGPROCNO)
        {
                /* Sleep until the leader clears our XID. */
                while (pg_atomic_read_u32(&proc->nextClearXidElem) != 
INVALID_PGPROCNO)
                {
                        extraWaits++;
                        PGSemaphoreLock(&proc->sem);
                }

                /* Fix semaphore count for any absorbed wakeups */
                while (extraWaits-- > 0)
                        PGSemaphoreUnlock(&proc->sem);
                return;
        }
...
        /*
         * Now that we've released the lock, go back and wake everybody up.  We
         * don't do this under the lock so as to keep lock hold times to a
         * minimum.  The system calls we need to perform to wake other processes
         * up are probably much slower than the simple memory writes we did 
while
         * holding the lock.
         */
        while (wakeidx != INVALID_PGPROCNO)
        {
                PGPROC  *proc = &allProcs[wakeidx];

                wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
                pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);

                if (proc != MyProc)
                        PGSemaphoreUnlock(&proc->sem);
        }

There's a bunch of issues with those two blocks afaics:

1) The first block (in one backend) might read INVALID_PGPROCNO before
   ever locking the semaphore if a second backend quickly enough writes
   INVALID_PGPROCNO. That way the semaphore will be permanently out of
   "balance".

2) There's no memory barriers around dealing with nextClearXidElem in
   the first block. Afaics there needs to be a read barrier before
   returning, otherwise it's e.g. not guaranteed that the woken up
   backend sees its own xid set to InvalidTransactionId.

3) If a follower returns before the leader has actually finished woken
   that respective backend up we can get into trouble:

   Consider what happens if such a follower enqueues in another
   transaction. It is not, as far as I could find out, guaranteed on all
   types of cpus that a third backend can actually see nextClearXidElem
   as INVALID_PGPROCNO. That'd likely require SMT/HT cores and multiple
   sockets. If the write to nextClearXidElem is entered into the local
   store buffer (leader #1) a hyper-threaded process (leader #2) can
   possibly see it (store forwarding) while another backend doesn't
   yet.

   I think this is very unlikely to be an actual problem due to
   independent barriers until enqueued again, but I don't want to rely
   on it undocumentedly. It seems safer to replace
   +            wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
   +            pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);
   with a pg_atomic_exchange_u32().


I think to fix these ProcArrayGroupClearXid() should use a protocol
similar to lwlock.c. E.g. make the two blocks somethign like:
        while (wakeidx != INVALID_PGPROCNO)
        {
                PGPROC  *proc = &allProcs[wakeidx];

                wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
                pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);

                /* ensure that all previous writes are visible before follower 
continues */
                pg_write_barrier();

                proc->lwWaiting = false;

                if (proc != MyProc)
                        PGSemaphoreUnlock(&proc->sem);
        }
and
        if (nextidx != INVALID_PGPROCNO)
        {
                Assert(!MyProc->lwWaiting);

                for (;;)
                {
                        /* acts as a read barrier */
                        PGSemaphoreLock(&MyProc->sem);
                        if (!MyProc->lwWaiting)
                                break;
                        extraWaits++;
                }

                Assert(pg_atomic_read_u32(&proc->nextClearXidElem) == 
INVALID_PGPROCNO)

                /* Fix semaphore count for any absorbed wakeups */
                while (extraWaits-- > 0)
                        PGSemaphoreUnlock(&proc->sem);
                return;
        }

Going through the patch:

+/*
+ * ProcArrayGroupClearXid -- group XID clearing
+ *
+ * When we cannot immediately acquire ProcArrayLock in exclusive mode at
+ * commit time, add ourselves to a list of processes that need their XIDs
+ * cleared.  The first process to add itself to the list will acquire
+ * ProcArrayLock in exclusive mode and perform ProcArrayEndTransactionInternal
+ * on behalf of all group members.  This avoids a great deal of context
+ * switching when many processes are trying to commit at once, since the lock
+ * only needs to be handed from the last share-locker to one process waiting
+ * for the exclusive lock, rather than to each one in turn.
+ */
+static void
+ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
+{

This comment, in my opinion, is rather misleading. If you have a
workload that primarily is slowed down due to transaction commits, this
patch doesn't actually change the number of context switches very
much. Previously all backends enqueued in the lwlock and got woken up
one-by-one. Safe backends 'jumping' the queue while a lock has been
released but the woken up backend doesn't yet run, there'll be exactly
as many context switches as today.

The difference is that only one backend has to actually acquire the
lock. So what has changed is the number of times, and the total
duration, the lock is actually held in exclusive mode.

+       /* Support for group XID clearing. */
+       volatile pg_atomic_uint32       nextClearXidElem;

+       /* First pgproc waiting for group XID clear */
+       volatile pg_atomic_uint32 nextClearXidElem;

Superfluous volatiles.

I don't think it's a good idea to use the variable name in PROC_HDR and
PGPROC, it's confusing.



How hard did you try checking whether this causes regressions? This
increases the number of atomics in the commit path a fair bit. I doubt
it's really bad, but it seems like a good idea to benchmark something
like a single full-throttle writer and a large number of readers.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to