[HACKERS] Seq scans roadmap

Heikki Linnakangas Tue, 08 May 2007 03:43:30 -0700

Here's my roadmap for the "scan-resistant buffer cache" and"synchronized scans" patches.

1. Fix the current vacuum behavior of throwing dirty buffers to thefreelist, forcing a lot of WAL flushes. Instead, use a backend-privatering of shared buffers that are recycled. This is what Simon's"scan-resistant buffer manager" did.

The theory here is that if a page is read in by vacuum, it's unlikely tobe accessed in the near future, therefore it should be recycled. Ifvacuum doesn't dirty the page, it's best to reuse the buffer immediatelyfor the next page. However, if the buffer is dirty (and not just becausewe set hint bits), we ought to delay writing it to disk until thecorresponding WAL record has been flushed to disk.

Simon's patch used a fixed size ring of buffers that are recycled, but Ithink the ring should be dynamically sized. Start with a small ring, andwhenever you need to do a WAL flush to write a dirty buffer, increasethe ring size. On every full iteration through the ring, decrease itssize to trim down an unnecessarily large ring.

This only alters the behavior of vacuums, and it's pretty safe to say itwon't get worse than what we have now. In the future, we can use thebuffer ring for seqscans as well; more on that on step 3.

2. Implement the list/table of last/ongoing seq scan positions. This isJeff's "synchronized scans" patch. When a seq scan starts on a tablelarger than some threshold, it starts from where the previous seq scanis currently, or where it ended. This will synchronize the scans so thatfor two concurrent scans the total I/O is halved in the best case. Thereshould be no other effect on performance.

If you have a partitioned table, or union of multiple tables or anyother plan where multiple seq scans are performed in arbitrary order,this change won't change the order the partitions are scanned and won'ttherefore ensure they will be synchronized.

Now that we have both pieces of the puzzle in place, it's time toconsider what more we can do with them:

3A. To take advantage of the "cache trail" of a previous seq scan, scanbackwards from where the previous seq scan ended, until a you hit abuffer that's not in cache.

This will allow taking advantage of the buffer cache even if the tabledoesn't fit completely in RAM. That can make a big difference if thetable size is just slightly bigger than RAM, and can avoid the nastysurprise when a table grows beyond RAM size and queries start takingminutes instead of seconds.

This should be a non-controversial change on its own from performancepoint of view. No query should get slower, and some will become faster.But see step 3B:

3B. Currently, sequential scans on a large table spoils the buffer cacheby evicting other pages from the cache. In CVS HEAD, as soon as thetable is larger than shared_buffers, the pages in the buffer won't beused to speed up running the same query again, and there's no reason tobelieve the pages read in would be more useful than any other page inthe database, and in particular the pages that were in the buffer cachebefore the huge seq scan. If the table being scanned is > 5 *shared_buffers, the scan will evict every other page from the cache ifthere's no other activity in the database (max usage_count is 5).

If the table is much larger than shared_buffers, say 10 times as large,even with the change 3B to read the pages that are in cache first, usingall shared_buffers to cache the table will only speed up the query by10%. We should not spoil the cache for such a small gain, and use thelocal buffer ring strategy instead. It's better to make queries that areslow anyway a little bit slower, than making queries that are normallyreally fast, slow.

As you may notice, 3A and 3B are at odds with each other. We canimplement both, but you can't use both strategies in the same scan.Therefore we need to have decision logic of some kind to figure outwhich strategy is optimal.


A simple heuristic is to decide based on the table size:

< 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache
> 5 * shared_buffers -> start from last read page, use buffer ring

I'm not sure about the constants, we might need to make them GUCvariables as Simon argued, but that would be the general approach.

In the future, I'm envisioning a smarter algorithm to size the localbuffer ring. Take all buffers with usage_count=0, plus a few withusage_count=1 from the clock sweep. That way if there's a lot of buffersin the buffer cache that are seldomly used, we'll use more buffers tocache the large scan, and vice versa. And no matter how large the scan,it wouldn't blow all buffers from the cache. But if you execute the samequery again, the buffers left in the cache from the last scan wereapparently useful, so we use a bigger ring this time.

I'm going to do this incrementally, and we'll see how far we get for8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish upSimon's patch (step 1), run some performance tests with vacuum, andsubmit a patch for that. Then I'll move to Jeff's patch (step 2).


Thoughts? Everyone happy with the roadmap?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

[HACKERS] Seq scans roadmap

Reply via email to