On Mon, Apr 14, 2014 at 8:11 PM, Peter Geoghegan <p...@heroku.com> wrote:
> PostgreSQL implements a clock sweep algorithm, which gets us something
> approaching an LRU for the buffer manager in trade-off for less
> contention on core structures. Buffers have a usage_count/"popularity"
> that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
> algorithm only has one bit for what approximates our "usage_count" (so
> it's either 0 or 1). I think that at its core CLOCK is an algorithm
> that has some very desirable properties that I am sure must be
> preserved. Actually, I think it's more accurate to say we use a
> variant of clock pro, a refinement of the original CLOCK.
PostgreSQL replacement algorithm is more similar to Generalized CLOCK
or GCLOCK, as described in . CLOCK-Pro  is a different algorithm
that approximates LIRS. LIRS is what MySQL implements and
CLOCK-Pro is implemented by NetBSD  and there has been some work on
trying it on Linux . Both LIRS and CLOCK-Pro work by keeping double
the cache size metadata entries and detect pages that have been
recently referenced. Basically they provide an adaptive tradeoff
between LRU and LFU.
> In the past, various hackers have noted problems they've observed with
> this scheme. A common pathology is to see frantic searching for a
> victim buffer only to find all buffer usage_count values at 5. It may
> take multiple revolutions of the clock hand before a victim buffer is
> found, as usage_count is decremented for each and every buffer. Also,
> BufFreelistLock contention is considered a serious bottleneck ,
> which is related.
There's a paper on a non blocking GCLOCK algorithm, that does lock
free clock sweep and buffer pinning. If we decide to stay with
GCLOCK it may be interesting, although I still believe that some
variant of buffer nailing is a better idea, my experience shows
that most of the locking overhead is cache line bouncing ignoring the
extreme cases where our naive spinlock implementation blows up.
> Let's leave aside inner/root pages though, because they're so
> dramatically useful when in a primary index on a tpb-b table that
> they'll always be cached by any non-terrible algorithm. It beggars
> belief that the still relatively dense (and consequently *popular*)
> B+Tree leaf pages get so little credit for being of such long-term
> utility (in the view of our clock sweep algorithm as implemented). The
> algorithm has what could be loosely described as an excessively
> short-term perspective. There is clearly a better balance to be had
> here. I don't think the answer is that we have the B-Tree code give
> its pages some special treatment that makes them harder to evict,
> although I will admit that I briefly entertained the idea.
There has been some research that indicates that for TPC-A workloads
giving index pages higher weights increases hitrates.
I think the hardest hurdle for any changes in this area will be
showing that we don't have any nasty regressions. I think the best way
to do that would be to study separately the performance overhead of
the replacement algorithm and optimality of the replacement choices.
If we capture a bunch of buffer reference traces by instrumenting
PinBuffer, we can pretty accurately simulate the behavior of different
algorithm and tuning choices with different shared buffer sizes.
Obviously full scale tests are still needed due to interactions with
OS, controller and disk caches and other miscellaneous influences. But
even so, simulation would get us much better coverage of various
workloads and at least some confidence that it's a good change
overall. It will be very hard and time consuming to gather equivalent
evidence with full scale tests.
Cybertec Schönig & Schönig GmbH
A-2700 Wiener Neustadt
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: