> That's interesting. Care to share the results of the 
> experiments you ran? I was thinking of running tests of my 
> own with varying table sizes.

Yah - it may take a while - you might get there faster.

There are some interesting effects to look at between I/O cache
performance and PG bufcache, and at those speeds the only tool I've
found that actually measures scan rate in PG is VACUUM.  "SELECT
COUNT(*)" measures CPU consumption in the aggregation node, not scan

Note that the copy from I/O cache to PG bufcache is where the L2 effect
is seen.
> The main motivation here is to avoid the sudden drop in 
> performance when a table grows big enough not to fit in RAM. 
> See attached diagram for what I mean. Maybe you're right and 
> the effect isn't that bad in practice.

There are going to be two performance drops, first when the table
doesn't fit into PG bufcache, the second when it doesn't fit in bufcache
+ I/O cache.  The second is severe, the first is almost insignificant
(for common queries).
> How is that different from what I described?

My impression of your descriptions is that they overvalue the case where
there are multiple scanners of a large (> 1x bufcache) table such that
they can share the "first load" of the bufcache, e.g. your 10% benefit
for table = 10x bufcache argument.  I think this is a non-common
workload, rather there are normally many small tables and several large
tables such that sharing the PG bufcache is irrelevant to the query

> Yeah I remember the discussion on the L2 cache a while back.
> What do you mean with using readahead inside the heapscan? 
> Starting an async read request?

Nope - just reading N buffers ahead for seqscans.  Subsequent calls use
previously read pages.  The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).
> > The modifications you suggest here may not have the following
> > properties:
> > - don't pollute bufcache for seqscan of tables > 1 x bufcache
> > - for tables > 1 x bufcache use a ringbuffer for I/O that 
> is ~ 32KB to 
> > minimize L2 cache pollution
> So the difference is that you don't want 3A (the take 
> advantage of pages already in buffer cache) strategy at all, 
> and want the buffer ring strategy to kick in earlier instead. 
> Am I reading you correctly?

Yes, I think the ring buffer strategy should be used when the table size
is > 1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).

- Luke

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at


Reply via email to