On Sat, Aug 17, 2013 at 10:55 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure <mmonc...@gmail.com> wrote:
>> *) What I think is happening:
>> I think we are again getting burned by getting de-scheduled while
>> holding the free list lock. I've been chasing this problem for a long
>> time now (for example, see:
>> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
>> but not I've got a reproducible case.  What is happening this:
>>
>> 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
>> 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
>> 3. Lock free list, go into clock sweep loop
>> 4. while holding clock sweep, hit 'hot' buffer, spin on it
>> 5. get de-scheduled
>> 6. now enter the 'hot buffer spin lock lottery'
>> 7. more/more backends pile on, linux scheduler goes bezerk, reducing
>> chances of winning #6
>> 8. finally win the lottery. lock released. everything back to normal.
>
> This is an interesting theory, but where's the evidence?  I've seen
> spinlock contention come from enough different places to be wary of
> arguments that start with "it must be happening because...".

Absolutely.  My evidence is circumstantial at best -- let's call it a
hunch.  I also do not think we are facing pure spinlock contention,
but something more complex which is a combination of spinlocks, the
free list lwlock, and the linux scheduler.  This problem showed up
going from RHEL 5->6 which brought a lot of scheduler changes.  A lot
of other things changed too, but the high sys cpu really suggests we
are getting some feedback from the scheduler.

> IMHO, the thing to do here is run perf record -g during one of the
> trouble periods.  The performance impact is quite low.  You could
> probably even set up a script that runs perf for five minute intervals
> at a time and saves all of the perf.data files.  When one of these
> spikes happens, grab the one that's relevant.

Unfortunately -- that's not on the table.  Dropping shared buffers to
2GB (thanks RhodiumToad) seems to have fixed the issue and there is
zero chance I will get approval to revert that setting in order to
force this to re-appear.  So far, I have not been able to reproduce in
testing.   By the way, this problem has popped up in other places too;
and the typical remedies are applied until it goes away :(.

> If you see that s_lock is where all the time is going, then you've
> proved it's a PostgreSQL spinlock rather than something in the kernel
> or a shared library.  If you can further see what's calling s_lock
> (which should hopefully be possible with perf -g), then you've got it
> nailed dead to rights.

Well, I don't think it's that simple.  So my plan of action is this:
1) Improvise a patch that removes one *possible* trigger for the
problem, or at least makes it much less likely to occur.  Also, in
real world cases where usage_count is examined N times before
returning a candidate buffer, the amount of overall spinlocking from
buffer allocating is reduced by approximately (N-1)/N.  Even though
spin locking is cheap, it's hard to argue with that...

2) Exhaustively performance test patch #1.  I think this is win-win
since SGB clock sweep loop is quite frankly relatively un-optimized. I
don't see how reducing the amount of locking could hurt performance
but I've been, uh, wrong about these types of things before.

3) If a general benefit without downside is shown from #2, I'll simply
advance the patch for the next CF and see how things shake out. If and
when I feel like there's a decent shot at getting accepted, I may go
through the motions of setting up a patched server in production and
attempting shared buffers.  But that's a long way off.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to