Hi,
On 2025-07-16 14:30:05 -0400, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 2:27 PM Andres Freund <[email protected]> wrote:
> > Could you share the current version of the complex patch (happy with a git
> > tree)? Afaict it hasn't been posted, which makes this pretty hard follow
> > along
> > / provide feedback on, for others.
>
> Sure:
>
> https://github.com/petergeoghegan/postgres/tree/index-prefetch-2025-pg-revisions-v0.11
>
> I think that the version that Tomas must have used is a few days old,
> and might be a tiny bit different. But I don't think that that's
> likely to matter, especially not if you just want to get the general
> idea.
As a first thing I just wanted to get a feel for the improvements we can get.
I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.
The improvement with either of the patchsets with a quick trial query is
rather impressive when using direct IO (presumably also with an empty cache,
but DIO is more predictable).
As Peter's branch doesn't seem to have an enable_* GUC, I used
SET effective_io_concurrency=0 to test the non-prefetching results (and
verified with master that the results are similar).
Test:
Peter's:
Without prefetching:
SET effective_io_concurrency=0;SELECT
pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM
lineitem ORDER BY l_shipdate LIMIT 10000;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.611..957.874
rows=10000.00 loops=1) │
│ Buffers: shared hit=1213 read=8626
│
│ I/O Timings: shared read=943.344
│
│ -> Index Scan using i_l_shipdate on lineitem (cost=0.44..6994824.33
rows=29999796 width=106) (actual time=0.611..956.593 rows=10000.00 loops=1) │
│ Index Searches: 1
│
│ Buffers: shared hit=1213 read=8626
│
│ I/O Timings: shared read=943.344
│
│ Planning Time: 0.083 ms
│
│ Execution Time: 958.508 ms
│
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
With prefetching:
SET effective_io_concurrency=64;SELECT
pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM
lineitem ORDER BY l_shipdate LIMIT 10000;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.497..67.737
rows=10000.00 loops=1) │
│ Buffers: shared hit=1227 read=8667
│
│ I/O Timings: shared read=48.473
│
│ -> Index Scan using i_l_shipdate on lineitem (cost=0.44..6994824.33
rows=29999796 width=106) (actual time=0.496..66.471 rows=10000.00 loops=1) │
│ Index Searches: 1
│
│ Buffers: shared hit=1227 read=8667
│
│ I/O Timings: shared read=48.473
│
│ Planning Time: 0.090 ms
│
│ Execution Time: 68.965 ms
│
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
Tomas':
With prefetching:
SET effective_io_concurrency=64;SELECT
pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM
lineitem ORDER BY l_shipdate LIMIT 10000;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.278..70.609
rows=10000.00 loops=1) │
│ Buffers: shared hit=1227 read=8668
│
│ I/O Timings: shared read=52.578
│
│ -> Index Scan using i_l_shipdate on lineitem (cost=0.44..6994824.33
rows=29999796 width=106) (actual time=0.277..69.304 rows=10000.00 loops=1) │
│ Index Searches: 1
│
│ Buffers: shared hit=1227 read=8668
│
│ I/O Timings: shared read=52.578
│
│ Planning Time: 0.072 ms
│
│ Execution Time: 71.549 ms
│
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
The wins are similar without DIO and a cold OS cache, but i don't like
emptying out the entire OS cache all the time...
I call that a hell of an impressive improvement with either patch - it's
really really hard to find order of magnitude improvements in anything close
to realistic cases.
And that's on a local reasonably fast NVMe - with networked storage we'll see
much bigger wins.
This also doesn't just repro with toy queries, e.g. TPCH Q02 shows a 2X
improvement too (with either patch) - the only reason it's not bigger is that
all the remaining IO time is on the inner side of a nestloop that isn't
currently prefetchable.
Peter, it'd be rather useful if your patch also had an enable/disable GUC,
otherwise it's more work to study the performance effects. The
effective_io_concurrency approach isn't great, because it also affects
bitmap scans, seqscans etc.
Just playing around, there are many cases where there is effectively no
difference between the two approaches, from a runtime perspective. There,
unsurprisingly, are some where the complex approach clearly wins, mostly
around IN(list-of-constants) so far.
Looking at the actual patches now.
Greetings,
Andres Freund