On Mon, Aug 5, 2013 at 8:49 AM, Merlin Moncure <mmonc...@gmail.com> wrote:
>
> *) What I think is happening:
> I think we are again getting burned by getting de-scheduled while
> holding the free list lock. I've been chasing this problem for a long
> time now (for example, see:
> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
> but not I've got a reproducible case.  What is happening this:
>
> 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
> 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
> 3. Lock free list, go into clock sweep loop
> 4. while holding clock sweep, hit 'hot' buffer, spin on it
> 5. get de-scheduled

It seems more likely to me that it got descheduled immediately after
obtaining the hot buffer header spinlock but before releasing it,
rather than while still spinning for it.


> 6. now enter the 'hot buffer spin lock lottery'
> 7. more/more backends pile on, linux scheduler goes bezerk, reducing
> chances of winning #6

Lots and lots of processes (which want the buffer to use, not to
evict) are spinning for the lock held by the descheduled process on
the buffer header, and the scheduler goes bezerk.  A bunch of other
processes are waiting on the relation extension lock, but are doing so
non-bezerk-ly on the semaphore.

Each spinning process puts itself to sleep without having consumed its
slice once it hits spins_per_delay, and so the scheduler keeps
rescheduling them; rather than scheduling the one that is holding the
lock, which consumed its entire slice and so is on the low priority to
reschedule list.

That is my theory, of course.  I'm not sure if it leads to a different
course of action than the theory it has not yet obtained the header
lock.

Any idea what spins_per_delay has converged to?  There seems to be no
way to obtain this info, other than firing up the debugger.  I had a
patch somewhere that has every process elog(LOG,...) the value which
it fetches from shared memory immediately upon start-up.  It is a pain
to hunt down the patch, apply, and compiler and restart the server
every time I want this value.  Maybe something like this could be put
in core with a suitable condition, but I don't know what that
condition would be.  Or would creating a C function with a
pg_spins_per_delay() wrapper function which returns this value on
demand be the way to go?



> Is the analysis and the patch to fix the perceived problem plausible
> without breaking other stuff..  If so, I'm inclined to go further with
> this.   This is not the only solution on the table for high buffer
> contention, but IMNSHO it should get a lot of points for being very
> localized.  Maybe a reduced version could be tried retaining the
> freelist lock but keeping the 'trylock' on the buf header.

My concern is how we can ever move this forward.  If we can't recreate
it on a test system, and you probably won't be allowed to push
experimental patches to the production system....what's left?

Also, if the kernel is introducing new scheduling bottlenecks, are we
just playing whack-a-mole by trying to deal with it in PG code?

Stepping back a bit, have you considered a connection pooler to
restrict the number of simultaneously active connections?  It wouldn't
be the first time that that has alleviated the symptoms of these
high-RAM kernel bottlenecks.  (If we are going to play whack-a-mole,
might as well use a hammer that already exists and is well tested)


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to