Re: index prefetching

Alexandre Felipe Mon, 02 Mar 2026 01:01:09 -0800

On Sun, Mar 1, 2026 at 11:33 PM Tomas Vondra <[email protected]> wrote:

> On 3/1/26 23:32, Alexandre Felipe wrote:
> >
> > On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi,
> >
> >     I've decided to run a couple tests, trying to reproduce some of the
> >     behaviors described in your (Felipe's) messages.
> >
> >
> > Thank you,
> > I will look into this data later. I am impressed with the number of IO
> > workers
> > you used, my test was typically with 3.
> >
>
> 3 is extremely low for an I/O bound system. It's our tradition to pick
> defaults that work even on tiny systems, but need tuning on actual
> non-toy systems :-(
>

That is was a surprise for me, because I am used to javascript
that does everything in one single process (with a coroutine
async model) and does with very little overhead.

Cold Cache (buffer eviction before each run):
io pf=off pf=on d<=16 d<=64 d<=128
 3 0.68s  1.20s 1.29s 0.78s 0.68s
10 0.75s  1.02s 1.51s 1.62s 0.82s
30 0.75s  0.79s 2.95s 1.65s 1.43s

Warm Cache (no eviction):
io pf=off pf=on d<=16 d<=64 d<=128
 3  0.42s 0.67s 0.40s 0.41s 0.41s
10  0.53s 0.96s 0.39s 0.38s 0.44s
30  0.39s 0.76s 0.39s 0.39s 0.40s

> Sure, but even then it'd be better to have more details about the
> hardware. "M1" doesn't really say much (especially to people who don't
> use Apple stuff very much). There's a range of M1-based systems, from
> MacBook Air/Pro, Mini, Studio, ...

I tried to find out some details of the hardware. It shows that I didn't
even know the (marketing) model name...

Chip: Apple M2, 8 (physical) cores, 24 GB, macOS Tahoe 26.1
Apple SSD AP2048Z (2 TB NVMe), block size 4kb, protocol: Apple Fabric

> >
> > Well, I was hoping to be able to create a self balancing mechanism
> > in read_stream_next_buffer
> >
> >  /* Do we have to wait for an associated I/O first? */
> > if (stream->ios_in_progress > 0 &&
> > stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
> > {
> >   // prefetch and increase the distance while we wait here
> > WaitReadBuffers(&stream->ios[io_index].op);
> >  ...
> > }
> > ...
> > // this call could be removed if we prefetched earlier.
> > read_stream_look_ahead(stream);
> >
> >
> > There same principle that guided the
> >> Don't wait for already in-progress IO
> > patch. Here we should prioritise increasing the distance, and if it is
> not
> > possible (maybe we consumed all the buffers). We could take the
> > opportunity to yield.
> >
> >
>
> IIUC the idea would be to (automatically) increase the distance just
> enough so that the IOs complete right before we actually need the
> buffers? That reminds me the "IO cost model" I mentioned a couple days
> ago [1]. But it's not clear to me how this profiling helps with it.
>

Exactly, actually my investigation started from your (Tomas)
suggestion, more specifically, this part:

> I think a "proper" solution would require some sort of cost model for
> the I/O part, so that we can schedule the I/Os just so that the I/O
> completes right before we actually need the page.

We are controlling the distance because we can, even knowing
that that is not what we need. We need to prefetch enough
to make sure the buffers will be ready when we need, but
on the other hand we need to minimise the reads of buffers
we don't need.

It is clear that reading unnecessary blocks could lead to
degradation of other queries in two ways. (1) delaying
concurrent tasks that require I/O, be it queries, maintenance,
replication. (2) future queries, by unnecessarily evicting
buffers.
If the system have queries reading on average N pages, and
we read just one additional page we get 1/N overhead. For
N=100 this is negligible, for N<=2 it is not.

Yielding is important, but it is also important that the
executor yield to the prefetch again before the distance
is too low.

DISTANCE INCREASE

The current distance increase mechanism captures the idea of
_completing right before we actually need the page_ by
increasing distance when a wait is required (well, reflecting
about it, I am not sure if that condition is free of false
positives). I think that increasing exponentially is too
aggressive. If waited for 10 buffers in a row you will have
a distance of ~1000 (potential waste of 100x).
Wouldn't it make more sense to increased by blocks in
increments based on I/O combine limit and I/O concurrency?
That way the waste is limited by a constant factor of the
required buffers.

DISTANCE DECREASE

Also the interpretation of distance going up is different
from the interpretation of distance going down.
When it goes up it is an upper bound for the distance
when it goes down it tracks the actual distance.
In this sense I agree with something written in

This could explain Andres Freund observation [1]

> It seems caused to a significant degree by waiting at low queue depths.
If I
> comment out the stream->distance-- in read_stream_start_pending_read() the
> regression is reduced greatly.

CLOSED LOOP THOUGHTS

Here I dump some ideas that we could work on, but but again,
this is something that could go on a separate commit, without
blocking this feature.

Let distance D of a scan that will make sure a given scan
have the buffers ready before consuming. For a long enough scan
we can find d >= D.

Label each operation with the distance (d_io) when it was started.
Every time we wait for a buffer, we know that d_io < D (by definition)
set d = max(d, d_io). Notice that the current model increments the
distance based from its current value not the distance when the I/O
started, this gives room for even more overshoot, but the prefetch
is not immediate, probably that is damping it, and I speculate that
the distance decay was added as a work around for this.

On the other hand, decreasing distance when we consume a buffer
will bring us again to risky ground, below the value that was
previously seen to be necessary to ensure no waits. Maybe this
decay was introduced just to counteract the excessive increase
due to feedback on the current distance instead of the distance
when the operations started.

Increasing exponentially is that we may have a long gap between the
computed distance and the actual required distance. Increasing
it by steps based on maximum I/O concurrency we would still start
as much I/O as could be useful, while keeping the gap between
the distance and the optimal distance small, it could also make it
even faster (just a little bit), by increasing the concurrency on
the initial reads when distance would be 1,2,4,8.

TIME AGNOSTIC MODEL

A proper model would have distinguish the distance of the I/O being
started, the I/O operation being finished, and the I/O distance that
we are aiming for.

start read:
  save io.distance = number of pinned buffers before this.
loop:
  if(have to wait)
    target_distance = io.distance + concurrency;
    start_prefetch
  if(distance <= min_distance)
    distance += prefetch(concurrency, combine_limit)
    if(buffer consumed)
      distance--

TIME AWARE MODEL

The IO model could take delays in consideration. I know that
timing has a cost, but hopefully it would be acceptable
to add the timestamp of the IO when it starts. Then we could
build a model based on an approximation of some quantile of the
distance. If we know e.g. 50% of the IOs take longer
than T(50%), then we could time the first few buffer consumption
to to have a guess, say, tc is the average time per buffer
required for this scan on this query. then T(50%) / tc, is a
a distance that puts us within a 50% chance of requiring a read.

But in order for this model to be effective we should keep it
global, if we make it per query when we tuned the parameters the
query already went too far, it could be saved in the index of stats,
along with the number of scans, tuples heap fetches.

A simpler, local version could be based on the time of the first
IO, but how do we time it? Currently we don't have a callback
to register when it is ready. Let's say that we time it just
after the WaitReadBuffers.

A pseudo code draft, that might be clearer than what I said

init:
t_start=now();
loop:
buffer_count++;
if(has to wait){
  t_wait = now()
  t_per_buffer = buffer_count / t_wait;
  // is this wait blocking other scans??
  WaitReadBuffer(io);
  t_ready = now();
  t_io = t_ready - t_io_start;
  distance_lower_bound = Max(distance_min, io.distance);
  distance_guess = Max(distance_guess, t_io / t_per_buffer);
  t_start = t_ready;
  buffer_count = 0;
}

We have the timing overhead, but notice that this is independent
of the time unit, or the epoch, some CPU tick count could be used
as a cheaper alternative. And we may do this just a few times

Distance lower bound would limit the distance decay, we already
know that we had to wait for an IO started at io.distance, so,
let's stay above that (unless we approach the scan limit).

I hope this makes sense.

I wrote about the signal overhead a couple months ago, with a simple
> benchmark simulating it [.]. I also wrote a brief explanation [.] about
> the AIO in PG18, which mentions that too.
>

Added to pending reads :)

[1]
https://www.postgresql.org/message-id/qdl4fojnbfcnm2k7b4zpvgd6gwzwdgtbl5c7shpimrb76dbyy6%40scdnspus3ejh

Re: index prefetching

Reply via email to