On Fri, Aug 12, 2011 at 6:53 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> The worst case behaviour of the current freelist code is that it can
> take up to 5 * shared_buffers checks before identifying a victim
> buffer. That occurs when we have a working set exactly matching size
> of shared buffers. There are problems when large portions of shared
> buffers are very popular, so that the free-ish buffers are a small
> proportion of the total buffers - this case gets worse if the
> distribution of buffer allocations is non-random so you get say 10
> free-ish blocks together, followed by N-10 very popular ones. That
> makes every 11th freelist request take ~O(shared_buffers). These
> problems will show themselves as sporadic BufFreelistLock contention.
> Sporadic is the problem here since it may make the contention hard to
> measure in aggregate, yet with real impact on performance.

I've been thinking about this exact same set of problems.

> For us to benefit from increased shared_buffers we need to have an
> algorithm that is O(k) in its worst case behaviour and O(1) in its
> best case behaviour - the latter aspect is provided by a correctly
> functioning bgwriter. At the moment, we do nothing to enforce O(k)
> behaviour.

Check.

> The following patch implements O(k) behaviour using a heuristic limit.
> My first attempt was a fixed heuristic limit, but that was less
> suitable because there are actually 2 requirements. The value of k
> must be small to have a measurable impact on the smoothness of
> StrategyGetBuffer(), but k must also be fairly large to ensure the
> choice of victim is a relatively good one. So the way to balance those
> conflicting objectives is to have a progressively increasing search
> window. An just to repeat, k is not a % of shared buffers, which would
> still give O(N) behaviour.

This is a very interesting idea, and I think it has a lot of potential.

One additional thought I had is that right now I think it's far too
easy for buffers to get pushed all the way up to usage count 5,
because we bump the reference count every time the buffer is pinned,
which can easily happen many times per clock sweep.  I think we would
be better off having a "used" flag on each buffer.  Each pin sets the
used bit, but does nothing if it is already set.  The usage count is
only changed by the clock sweep, which does the following on each
buffer:

- If the "used" flag is set, then the flag is cleared and the usage
count increases by one to a maximum of 5.
- If the "used" flag is not set, then the usage count decreases by one.

That way, buffers can't get hot because of a fast flurry of access
followed by nothing - to get up to usage_count 5, they've actually got
to stay interesting for a period of time.  As a side effect, I believe
we could get rid of the spinlock that we currently take and release
for every buffer the clock sweep hits, because we could regard the
usage_count as protected by the BufFreelistLock rather than by the
buffer header spinlock; and the "used" flag doesn't require perfect
synchronization.  We'd only pin the buffer we actually plan to evict
(and could recheck the "used" flag before doing so, in case a memory
ordering effect made us miss the fact that it had been recently set).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to