On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching.  The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not

On 3B: The scenario described is "multiple readers seq scanning large
table and sharing bufcache", but in practice this is not a common
situation.  The common situation is "multiple queries joining several
small tables to one or more large tables that are >> 1 x bufcache".  In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.

To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer.  The effect has
been demonstrated to be significant - in the 20%+ range.  Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.

We've implemented both ideas (ring buffer, readahead) and see very
significant improvements in I/O and query speeds on DSS workloads.  I
would expect benefits on OLTP as well.

The modifications you suggest here may not have the following
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution

- Luke

> -----Original Message-----
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Heikki Linnakangas
> Sent: Tuesday, May 08, 2007 3:40 AM
> To: PostgreSQL-development
> Cc: Jeff Davis; Simon Riggs
> Subject: [HACKERS] Seq scans roadmap
> Here's my roadmap for the "scan-resistant buffer cache" and 
> "synchronized scans" patches.
> 1. Fix the current vacuum behavior of throwing dirty buffers 
> to the freelist, forcing a lot of WAL flushes. Instead, use a 
> backend-private ring of shared buffers that are recycled. 
> This is what Simon's "scan-resistant buffer manager" did.
> The theory here is that if a page is read in by vacuum, it's 
> unlikely to be accessed in the near future, therefore it 
> should be recycled. If vacuum doesn't dirty the page, it's 
> best to reuse the buffer immediately for the next page. 
> However, if the buffer is dirty (and not just because we set 
> hint bits), we ought to delay writing it to disk until the 
> corresponding WAL record has been flushed to disk.
> Simon's patch used a fixed size ring of buffers that are 
> recycled, but I think the ring should be dynamically sized. 
> Start with a small ring, and whenever you need to do a WAL 
> flush to write a dirty buffer, increase the ring size. On 
> every full iteration through the ring, decrease its size to 
> trim down an unnecessarily large ring.
> This only alters the behavior of vacuums, and it's pretty 
> safe to say it won't get worse than what we have now. In the 
> future, we can use the buffer ring for seqscans as well; more 
> on that on step 3.
> 2. Implement the list/table of last/ongoing seq scan 
> positions. This is Jeff's "synchronized scans" patch. When a 
> seq scan starts on a table larger than some threshold, it 
> starts from where the previous seq scan is currently, or 
> where it ended. This will synchronize the scans so that for 
> two concurrent scans the total I/O is halved in the best 
> case. There should be no other effect on performance.
> If you have a partitioned table, or union of multiple tables 
> or any other plan where multiple seq scans are performed in 
> arbitrary order, this change won't change the order the 
> partitions are scanned and won't therefore ensure they will 
> be synchronized.
> Now that we have both pieces of the puzzle in place, it's 
> time to consider what more we can do with them:
> 3A. To take advantage of the "cache trail" of a previous seq 
> scan, scan 
> backwards from where the previous seq scan ended, until a you hit a 
> buffer that's not in cache.
> This will allow taking advantage of the buffer cache even if 
> the table 
> doesn't fit completely in RAM. That can make a big difference if the 
> table size is just slightly bigger than RAM, and can avoid the nasty 
> surprise when a table grows beyond RAM size and queries start taking 
> minutes instead of seconds.
> This should be a non-controversial change on its own from performance 
> point of view. No query should get slower, and some will 
> become faster. 
> But see step 3B:
> 3B. Currently, sequential scans on a large table spoils the 
> buffer cache 
> by evicting other pages from the cache. In CVS HEAD, as soon as the 
> table is larger than shared_buffers, the pages in the buffer won't be 
> used to speed up running the same query again, and there's no 
> reason to 
> believe the pages read in would be more useful than any other page in 
> the database, and in particular the pages that were in the 
> buffer cache 
> before the huge seq scan. If the table being scanned is > 5 * 
> shared_buffers, the scan will evict every other page from the 
> cache if 
> there's no other activity in the database (max usage_count is 5).
> If the table is much larger than shared_buffers, say 10 times 
> as large, 
> even with the change 3B to read the pages that are in cache 
> first, using 
> all shared_buffers to cache the table will only speed up the query by 
> 10%. We should not spoil the cache for such a small gain, and use the 
> local buffer ring strategy instead. It's better to make 
> queries that are 
> slow anyway a little bit slower, than making queries that are 
> normally 
> really fast, slow.
> As you may notice, 3A and 3B are at odds with each other. We can 
> implement both, but you can't use both strategies in the same scan. 
> Therefore we need to have decision logic of some kind to figure out 
> which strategy is optimal.
> A simple heuristic is to decide based on the table size:
> < 0.1*shared_buffers  -> start from page 0, keep in cache 
> (like we do now)
> < 5 * shared_buffers  -> start from last read page, keep in cache
>  > 5 * shared_buffers -> start from last read page, use buffer ring
> I'm not sure about the constants, we might need to make them GUC 
> variables as Simon argued, but that would be the general approach.
> In the future, I'm envisioning a smarter algorithm to size the local 
> buffer ring. Take all buffers with usage_count=0, plus a few with 
> usage_count=1 from the clock sweep. That way if there's a lot 
> of buffers 
> in the buffer cache that are seldomly used, we'll use more buffers to 
> cache the large scan, and vice versa. And no matter how large 
> the scan, 
> it wouldn't blow all buffers from the cache. But if you 
> execute the same 
> query again, the buffers left in the cache from the last scan were 
> apparently useful, so we use a bigger ring this time.
> I'm going to do this incrementally, and we'll see how far we get for 
> 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up 
> Simon's patch (step 1), run some performance tests with vacuum, and 
> submit a patch for that. Then I'll move to Jeff's patch (step 2).
> Thoughts? Everyone happy with the roadmap?
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>                http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to