Heikki, On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc) implement dynamic I/O caching. The experiments have shown that benefit of re-using PG buffer cache on large sequential scans is vanishingly small when the buffer cache size is small compared to the system memory. Since this is a normal and recommended situation (OS I/O cache is auto-tuning and easy to administer, etc), IMO the effort to optimize buffer cache reuse for seq scans > 1 x buffer cache size is not worthwhile.
On 3B: The scenario described is "multiple readers seq scanning large table and sharing bufcache", but in practice this is not a common situation. The common situation is "multiple queries joining several small tables to one or more large tables that are >> 1 x bufcache". In the common scenario, the dominant factor is the ability to keep the small tables in bufcache (or I/O cache for that matter) while running the I/O bound large table scans as fast as possible. To that point - an important factor in achieving max I/O rate for large tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache. This is commonly in the range of 512KB -> 2MB, which is only important when considering a bound on the size of the ring buffer. The effect has been demonstrated to be significant - in the 20%+ range. Another thing to consider is the use of readahead inside the heapscan, in which case sizes >= 32KB are very effective. We've implemented both ideas (ring buffer, readahead) and see very significant improvements in I/O and query speeds on DSS workloads. I would expect benefits on OLTP as well. The modifications you suggest here may not have the following properties: - don't pollute bufcache for seqscan of tables > 1 x bufcache - for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to minimize L2 cache pollution - Luke > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Heikki Linnakangas > Sent: Tuesday, May 08, 2007 3:40 AM > To: PostgreSQL-development > Cc: Jeff Davis; Simon Riggs > Subject: [HACKERS] Seq scans roadmap > > Here's my roadmap for the "scan-resistant buffer cache" and > "synchronized scans" patches. > > 1. Fix the current vacuum behavior of throwing dirty buffers > to the freelist, forcing a lot of WAL flushes. Instead, use a > backend-private ring of shared buffers that are recycled. > This is what Simon's "scan-resistant buffer manager" did. > > The theory here is that if a page is read in by vacuum, it's > unlikely to be accessed in the near future, therefore it > should be recycled. If vacuum doesn't dirty the page, it's > best to reuse the buffer immediately for the next page. > However, if the buffer is dirty (and not just because we set > hint bits), we ought to delay writing it to disk until the > corresponding WAL record has been flushed to disk. > > Simon's patch used a fixed size ring of buffers that are > recycled, but I think the ring should be dynamically sized. > Start with a small ring, and whenever you need to do a WAL > flush to write a dirty buffer, increase the ring size. On > every full iteration through the ring, decrease its size to > trim down an unnecessarily large ring. > > This only alters the behavior of vacuums, and it's pretty > safe to say it won't get worse than what we have now. In the > future, we can use the buffer ring for seqscans as well; more > on that on step 3. > > 2. Implement the list/table of last/ongoing seq scan > positions. This is Jeff's "synchronized scans" patch. When a > seq scan starts on a table larger than some threshold, it > starts from where the previous seq scan is currently, or > where it ended. This will synchronize the scans so that for > two concurrent scans the total I/O is halved in the best > case. There should be no other effect on performance. > > If you have a partitioned table, or union of multiple tables > or any other plan where multiple seq scans are performed in > arbitrary order, this change won't change the order the > partitions are scanned and won't therefore ensure they will > be synchronized. > > > Now that we have both pieces of the puzzle in place, it's > time to consider what more we can do with them: > > > 3A. To take advantage of the "cache trail" of a previous seq > scan, scan > backwards from where the previous seq scan ended, until a you hit a > buffer that's not in cache. > > This will allow taking advantage of the buffer cache even if > the table > doesn't fit completely in RAM. That can make a big difference if the > table size is just slightly bigger than RAM, and can avoid the nasty > surprise when a table grows beyond RAM size and queries start taking > minutes instead of seconds. > > This should be a non-controversial change on its own from performance > point of view. No query should get slower, and some will > become faster. > But see step 3B: > > 3B. Currently, sequential scans on a large table spoils the > buffer cache > by evicting other pages from the cache. In CVS HEAD, as soon as the > table is larger than shared_buffers, the pages in the buffer won't be > used to speed up running the same query again, and there's no > reason to > believe the pages read in would be more useful than any other page in > the database, and in particular the pages that were in the > buffer cache > before the huge seq scan. If the table being scanned is > 5 * > shared_buffers, the scan will evict every other page from the > cache if > there's no other activity in the database (max usage_count is 5). > > If the table is much larger than shared_buffers, say 10 times > as large, > even with the change 3B to read the pages that are in cache > first, using > all shared_buffers to cache the table will only speed up the query by > 10%. We should not spoil the cache for such a small gain, and use the > local buffer ring strategy instead. It's better to make > queries that are > slow anyway a little bit slower, than making queries that are > normally > really fast, slow. > > > As you may notice, 3A and 3B are at odds with each other. We can > implement both, but you can't use both strategies in the same scan. > Therefore we need to have decision logic of some kind to figure out > which strategy is optimal. > > A simple heuristic is to decide based on the table size: > > < 0.1*shared_buffers -> start from page 0, keep in cache > (like we do now) > < 5 * shared_buffers -> start from last read page, keep in cache > > 5 * shared_buffers -> start from last read page, use buffer ring > > I'm not sure about the constants, we might need to make them GUC > variables as Simon argued, but that would be the general approach. > > In the future, I'm envisioning a smarter algorithm to size the local > buffer ring. Take all buffers with usage_count=0, plus a few with > usage_count=1 from the clock sweep. That way if there's a lot > of buffers > in the buffer cache that are seldomly used, we'll use more buffers to > cache the large scan, and vice versa. And no matter how large > the scan, > it wouldn't blow all buffers from the cache. But if you > execute the same > query again, the buffers left in the cache from the last scan were > apparently useful, so we use a bigger ring this time. > > I'm going to do this incrementally, and we'll see how far we get for > 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up > Simon's patch (step 1), run some performance tests with vacuum, and > submit a patch for that. Then I'll move to Jeff's patch (step 2). > > Thoughts? Everyone happy with the roadmap? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > > ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend