While I was poking around at the index-only scans patch, I couldn't help noticing that our buffer replacement algorithm leaves something to be desired. Here's the query again:
select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); As sample_data is not indexed, it sequential scans that table and does an index-only scan of pgbench_accounts for each aid. When this is done, here's what pg_buffercache has to say: rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1; usagecount | sum ------------+------- 1 | 136 2 | 12 3 | 72 4 | 7 5 | 13755 | 37218 (6 rows) Only 96 of the 14286 buffers in sample_data are in shared_buffers, despite the fact that we have 37,218 *completely unused* buffers lying around. That sucks, because it means that the sample query did a whole lot of buffer eviction that was completely needless. The ring buffer is there to prevent trashing the shared buffer arena, but here it would have been fine to trash the arena: there wasn't anything there we would have minded losing (or, indeed, anything at all). On the other hand, the buffer manager has *no problem at all* trashing the buffer arena if we're faulting in pages for an index scan rather than a sequential scan. If you manage to get all of sample_data into memory (by running many copies of the above query in parallel, you can get each one to allocate its own ring buffer, and eventually pull in all the pages), and then run some query that probes an index which is too large to fit in shared_buffers, it cheerfully blows the whole sample_data table out without a second thought. Had you sequentially scanned a big table, of course, there would be some protection, but an index scan can stomp all over everything with complete impunity. The general problem here is that we are not very smart about handling workloads with weak locality - i.e. the working set is larger than shared buffers. If the working set fits in shared_buffers, we will keep it there, and it will be strongly protected against eviction. But if the working set *doesn't* fit in shared buffers, then we fall back on some not-very-smart heuristics to decide what to do: keep pages involved in index scans, ditch those involved in sequential scans. It seems to me that we need a smarter algorithm. It seems that NetBSD has adopted the use of Clock-Pro, and there are some discussions of it being used in Linux as well, though I'm not sure whether that's actually (still?) the case. Paper is here, although I find the description of the algorithm to be approximately clear as mud: http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf One of the important notions which many of the algorithms in the literature seem to hit on in one way or another is that you mustn't go and decide that all of your buffers - or nearly all your buffers - are hot. That's really equivalent to saying that none of your buffers are hot; and it's also pretty easy to see that such a scenario would be disastrous for our current algorithm, which - if everything in shared_buffers has usage-count 5 all the time - will have to clock sweep a long way each time it needs to allocate a buffer. In fact, the more of your buffers want to be hot (and thus hard to evict), the fewer of them should actually be allowed to acquire that status. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers