On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching. The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not
On 3B: The scenario described is "multiple readers seq scanning large
table and sharing bufcache", but in practice this is not a common
situation. The common situation is "multiple queries joining several
small tables to one or more large tables that are >> 1 x bufcache". In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.
To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer. The effect has
been demonstrated to be significant - in the 20%+ range. Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.
We've implemented both ideas (ring buffer, readahead) and see very
significant improvements in I/O and query speeds on DSS workloads. I
would expect benefits on OLTP as well.
The modifications you suggest here may not have the following
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> Heikki Linnakangas
> Sent: Tuesday, May 08, 2007 3:40 AM
> To: PostgreSQL-development
> Cc: Jeff Davis; Simon Riggs
> Subject: [HACKERS] Seq scans roadmap
> Here's my roadmap for the "scan-resistant buffer cache" and
> "synchronized scans" patches.
> 1. Fix the current vacuum behavior of throwing dirty buffers
> to the freelist, forcing a lot of WAL flushes. Instead, use a
> backend-private ring of shared buffers that are recycled.
> This is what Simon's "scan-resistant buffer manager" did.
> The theory here is that if a page is read in by vacuum, it's
> unlikely to be accessed in the near future, therefore it
> should be recycled. If vacuum doesn't dirty the page, it's
> best to reuse the buffer immediately for the next page.
> However, if the buffer is dirty (and not just because we set
> hint bits), we ought to delay writing it to disk until the
> corresponding WAL record has been flushed to disk.
> Simon's patch used a fixed size ring of buffers that are
> recycled, but I think the ring should be dynamically sized.
> Start with a small ring, and whenever you need to do a WAL
> flush to write a dirty buffer, increase the ring size. On
> every full iteration through the ring, decrease its size to
> trim down an unnecessarily large ring.
> This only alters the behavior of vacuums, and it's pretty
> safe to say it won't get worse than what we have now. In the
> future, we can use the buffer ring for seqscans as well; more
> on that on step 3.
> 2. Implement the list/table of last/ongoing seq scan
> positions. This is Jeff's "synchronized scans" patch. When a
> seq scan starts on a table larger than some threshold, it
> starts from where the previous seq scan is currently, or
> where it ended. This will synchronize the scans so that for
> two concurrent scans the total I/O is halved in the best
> case. There should be no other effect on performance.
> If you have a partitioned table, or union of multiple tables
> or any other plan where multiple seq scans are performed in
> arbitrary order, this change won't change the order the
> partitions are scanned and won't therefore ensure they will
> be synchronized.
> Now that we have both pieces of the puzzle in place, it's
> time to consider what more we can do with them:
> 3A. To take advantage of the "cache trail" of a previous seq
> scan, scan
> backwards from where the previous seq scan ended, until a you hit a
> buffer that's not in cache.
> This will allow taking advantage of the buffer cache even if
> the table
> doesn't fit completely in RAM. That can make a big difference if the
> table size is just slightly bigger than RAM, and can avoid the nasty
> surprise when a table grows beyond RAM size and queries start taking
> minutes instead of seconds.
> This should be a non-controversial change on its own from performance
> point of view. No query should get slower, and some will
> become faster.
> But see step 3B:
> 3B. Currently, sequential scans on a large table spoils the
> buffer cache
> by evicting other pages from the cache. In CVS HEAD, as soon as the
> table is larger than shared_buffers, the pages in the buffer won't be
> used to speed up running the same query again, and there's no
> reason to
> believe the pages read in would be more useful than any other page in
> the database, and in particular the pages that were in the
> buffer cache
> before the huge seq scan. If the table being scanned is > 5 *
> shared_buffers, the scan will evict every other page from the
> cache if
> there's no other activity in the database (max usage_count is 5).
> If the table is much larger than shared_buffers, say 10 times
> as large,
> even with the change 3B to read the pages that are in cache
> first, using
> all shared_buffers to cache the table will only speed up the query by
> 10%. We should not spoil the cache for such a small gain, and use the
> local buffer ring strategy instead. It's better to make
> queries that are
> slow anyway a little bit slower, than making queries that are
> really fast, slow.
> As you may notice, 3A and 3B are at odds with each other. We can
> implement both, but you can't use both strategies in the same scan.
> Therefore we need to have decision logic of some kind to figure out
> which strategy is optimal.
> A simple heuristic is to decide based on the table size:
> < 0.1*shared_buffers -> start from page 0, keep in cache
> (like we do now)
> < 5 * shared_buffers -> start from last read page, keep in cache
> > 5 * shared_buffers -> start from last read page, use buffer ring
> I'm not sure about the constants, we might need to make them GUC
> variables as Simon argued, but that would be the general approach.
> In the future, I'm envisioning a smarter algorithm to size the local
> buffer ring. Take all buffers with usage_count=0, plus a few with
> usage_count=1 from the clock sweep. That way if there's a lot
> of buffers
> in the buffer cache that are seldomly used, we'll use more buffers to
> cache the large scan, and vice versa. And no matter how large
> the scan,
> it wouldn't blow all buffers from the cache. But if you
> execute the same
> query again, the buffers left in the cache from the last scan were
> apparently useful, so we use a bigger ring this time.
> I'm going to do this incrementally, and we'll see how far we get for
> 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
> Simon's patch (step 1), run some performance tests with vacuum, and
> submit a patch for that. Then I'll move to Jeff's patch (step 2).
> Thoughts? Everyone happy with the roadmap?
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
> ---------------------------(end of
> TIP 4: Have you searched our list archives?
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend