Simon Riggs wrote:
We need to think of the interaction with partitioning here. People will
ask whether we would recommend that individual partitions of a large
table should be larger/smaller than a particular size, to allow these
optimizations to kick in.

My thinking is that database designers would attempt to set partition
size larger than the sync scan limit, whatever it is. That means:
- they wouldn't want the limit to vary when cache increases, so we *do*
need a GUC to control the limit. My suggestion now would be
large_scan_threshold, since it effects both caching and synch scans.

They wouldn't? If you add more memory to your box, so that a table that didn't fit in memory before does now fit, surely you want to switch your strategy from "don't pollute the cache because it won't fit anyway" to "let's keep it in cache, so the next scan won't do I/O".

The basic problem with partitions is that if you have a query like "SELECT * FROM partitioned_table", so that you seq scan multiple partitions, the size of each partition alone could be below the threshold, whatever that is, but since you're scanning them all the net result is the same as scanning one large table above the threshold. The same could happen in any plan that does multiple seq scans. It could even happen at the application level if the application submits multiple statements like "SELECT * FROM table1", "SELECT * FROM table2" one after each other.

One way to address that would be to manage the recycle buffer ring size dynamically. When a backend gets a cache miss, the ring would shrink, and when you get cache hits, it would grow. That way, if you do a seq scan on a table that fits in cache repeatedly, that table would get more buffers from the cache on each iteration until it's completely in cache. But if the table or tables being scanned are too big to fit in cache, the ring would stay small and not pollute the cache much.

I'm not going to implement that for now, I'm seeing some scary negative feedback behavior with that, and it'd need a lot of testing anyway. I'm thinking of just using shared_buffers as the limit. One could argue for effective_cache_size as well.

- so there will be lots of partitions, so a hardcoded limit of 1000
would not be sufficient. A new GUC, or a link to an existing one, is
probably required.

No matter how many partitions you have, each backend could still be scanning only one of them at a time.

  Heikki Linnakangas

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at


Reply via email to