[HACKERS] StrategyGetBuffer optimization, take 2

Merlin Moncure Mon, 05 Aug 2013 08:50:44 -0700

My $company recently acquired another postgres based $company and
migrated all their server operations into our datacenter.  Upon
completing the move, the newly migrated database server started
experiencing huge load spikes.

*) Environment description:
Postgres 9.2.4
RHEL 6
32 cores
virtualized (ESX) but with a dedicated host
256GB ram
shared_buffers: 32G
96 application servers configured to max 5 connections each
very fast i/o
database size: ~ 200GB
HS/SR: 3 slaves

*) Problem description:
The server normally hums along nicely with load < 1.0 and no iowait --
in fact the server is massively over-provisioned. However, on
semi-random basis (once every 1-2 days) load absolutely goes through
the roof to 600+, no iowait, 90-100% (70%+ sys) cpu. It hangs around
like that for 5-20 minutes then resolves as suddenly as it started.
There is nothing interesting going on application side (except the
application servers are all piling on) but pg_locks is recording lots
of contention on relation 'extension locks'. One interesting point
is that the slaves are also affected, but the precise point of the
high load affects happens some seconds after the master.

*) Initial steps taken:
RhodiumToad aka (Andrew G) has seen this in the wild several times and
suggested dropping shared_buffers significantly might resolve the
situation short term. That was done on friday night, and so far
problem has not re-occurred.

*) What I think is happening:
I think we are again getting burned by getting de-scheduled while
holding the free list lock. I've been chasing this problem for a long
time now (for example, see:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
but not I've got a reproducible case. What is happening this:

1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer()
3. Lock free list, go into clock sweep loop
4. while holding clock sweep, hit 'hot' buffer, spin on it
5. get de-scheduled
6. now enter the 'hot buffer spin lock lottery'
7. more/more backends pile on, linux scheduler goes bezerk, reducing
chances of winning #6
8. finally win the lottery. lock released. everything back to normal.

*) what I would like to do to fix it:
see attached patch. This builds on the work of Jeff Janes to remove
the free list lock and has some extra optimizations in the clock sweep
loop:

optimization 1: usage count is advisory. it is not updated behind the
buffer lock. in the event there are a large sequences of buffers with
>0 usage_count, this avoids spamming the cache_line lock; you
decrement and hope for the best

optimization 2: refcount is examined during buffer allocation without
a lock. if it's > 0, buffer is assumed pinned (even though it may not
in fact be) and sweep continues

optimization 3: sweep does not wait on buf header lock. instead, it
does 'try lock' and bails if the buffer is determined pinned. I
believe this to be one of the two critical optimizations

optimization 4: remove free list lock (via Jeff Janes). This is the
other optimization: one backend will no longer be able to shut down
buffer allocation

*) what I'm asking for

Is the analysis and the patch to fix the perceived problem plausible
without breaking other stuff.. If so, I'm inclined to go further with
this. This is not the only solution on the table for high buffer
contention, but IMNSHO it should get a lot of points for being very
localized. Maybe a reduced version could be tried retaining the
freelist lock but keeping the 'trylock' on the buf header.

*) further reading:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC8QFjAA&url=http%3A%2F%2Fpostgresql.1045698.n5.nabble.com%2FHigh-SYS-CPU-need-advise-td5732045.html&ei=hsb_Uc6pB4Ss9ASN7YHoAg&usg=AFQjCNEefMxOvjvW3Alg4TiXqCSAUmDR7A&sig2=EyPOQa9XbVEND5kwzTeBJg&bvm=bv.50165853,d.eWU

http://www.postgresql.org/message-id/cahyxu0x47d4n6edpynyadshxqqxkohelv2cbrgr_2ngrc8k...@mail.gmail.com

http://postgresql.1045698.n5.nabble.com/Page-replacement-algorithm-in-buffer-cache-td5749236.html

merlin

buffer2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] StrategyGetBuffer optimization, take 2

Reply via email to