On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <f...@phlo.org> wrote:
> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>> So, the majority (60%) of the excess spinning appears to be due to
>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>
> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
> so I guess that two adjacent LWLocks currently share one cache line.
>
> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
> index 5, so if I'm not mistaken exactly the two locks where you saw
> the largest contention on are on the same cache line...
>
> Might make sense to try and see if these numbers change if you
> either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit.  It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

SInvalReadLock is a good example of a lock which seems barely to be
necessary at all.  Except on extraordinarily rare occasions, it is
only ever taken in share mode.  Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy.  I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether.  I have a feeling if I
were 25% smarter I could do it, but I'm not.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read.  We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing.  I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written.  What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read.  When a message is written, the global counter is bumped.  When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

There is an obvious (potential) memory-ordering problem here.  If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen.  Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock.  It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before.  As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good.  On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic.  In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough.  We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces).  When a backend compares its counter to the
global counter, it does so while holding this lock.  When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1).  So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

If anyone has read this far, you have my undying gratitude....
especially if you respond with comments.

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes.  I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to