Luke Lonergan wrote:
On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching.  The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not
worthwhile.

That's interesting. Care to share the results of the experiments you ran? I was thinking of running tests of my own with varying table sizes.

The main motivation here is to avoid the sudden drop in performance when a table grows big enough not to fit in RAM. See attached diagram for what I mean. Maybe you're right and the effect isn't that bad in practice.

I'm thinking of attacking 3B first anyway, because it seems much simpler to implement.

On 3B: The scenario described is "multiple readers seq scanning large
table and sharing bufcache", but in practice this is not a common
situation.  The common situation is "multiple queries joining several
small tables to one or more large tables that are >> 1 x bufcache".  In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.

How is that different from what I described?

To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer.  The effect has
been demonstrated to be significant - in the 20%+ range.  Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.

Yeah I remember the discussion on the L2 cache a while back.

What do you mean with using readahead inside the heapscan? Starting an async read request?

The modifications you suggest here may not have the following
properties:
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution

So the difference is that you don't want 3A (the take advantage of pages already in buffer cache) strategy at all, and want the buffer ring strategy to kick in earlier instead. Am I reading you correctly?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

<<inline: seqscan-caching.png>>

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Reply via email to