Re: [HACKERS] Page replacement algorithm in buffer cache

Jim Nasby Fri, 22 Mar 2013 23:18:34 -0700

On 3/22/13 7:27 PM, Ants Aasma wrote:

On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmonc...@gmail.com> wrote:

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately.  if the non locking test comes good, only
then do you do a hardware TAS.


you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount.  so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.
  this would of course require some amusing adjustments to various
logical checks (usage_count <= 0, heh).


Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf


Partitioned clock sweep strikes me as a bad idea... you could certainly get 
unlucky and end up with a lot of hot stuff in one partition.

Another idea that'sbeen broughht up inthe past is to have something in the 
background keep a minimum number of buffers on the free list. That's how OS VM 
systems I'm familiar with work, so there's precedent for it.

I recall there were at least some theoretical concerns about this, but I don't 
remember if anyone actually tested the idea.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page replacement algorithm in buffer cache

Reply via email to