Re: index prefetching

Andres Freund Mon, 02 Mar 2026 07:02:22 -0800

Hi,

On 2026-03-02 14:18:31 +0100, Tomas Vondra wrote:
> On 3/2/26 10:00, Alexandre Felipe wrote:
> >
> >
> > On Sun, Mar 1, 2026 at 11:33 PM Tomas Vondra <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     On 3/1/26 23:32, Alexandre Felipe wrote:
> >     >
> >     > On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected]
> >     <mailto:[email protected]>
> >     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >     >
> >     >     Hi,
> >     >
> >     >     I've decided to run a couple tests, trying to reproduce some
> >     of the
> >     >     behaviors described in your (Felipe's) messages.
> >     >
> >     >
> >     > Thank you,
> >     > I will look into this data later. I am impressed with the number of IO
> >     > workers
> >     > you used, my test was typically with 3.
> >     >
> >
> >     3 is extremely low for an I/O bound system. It's our tradition to pick
> >     defaults that work even on tiny systems, but need tuning on actual
> >     non-toy systems :-(
> >
> >
> > That is was a surprise for me, because I am used to javascript
> > that does everything in one single process (with a coroutine
> > async model) and does with very little overhead.
> >
>
> Well, we don't have coroutines (or other threads). If you want something
> similarly lightweight, you need to use io_uring. Which means you need to
> be on Linux, unfortunately.


Even if we had coroutines, there's no portable way of integrating them with
file IO.  For that you need either something like io_uring, IOCP or posix_aio
(although the latter performs so terribly on most platforms, including macos,
it's not worth using).

> > This could explain Andres Freund observation [1]
> >
> >> It seems caused to a significant degree by waiting at low queue
> > depths. If I
> >> comment out the stream->distance-- in read_stream_start_pending_read() the
> >> regression is reduced greatly.
> >
>
> Yes, because the distance heuristics is based solely on buffer hits and
> misses (and not on some direct feedback if we're prefetching far enough).
>
> For any algorithm that increases/decreases the distance (e.g. the
> current distance*2 and distance--) there is a data patterns that
> collapses to 1 too early. It just takes the right fraction of hits and
> misses. (If we ignore "trivial" heuristics that never decreases the
> distance etc.)

I think this is making the problem sound harder than it has to be to be a
significant im provement:

Right now an IO pattern like [(miss, hit)+] will oscillate between distance =
0 and 1, and thus will do all IO synchronuously. I.e. every second required
buffer will require a synchronous read - you'll have a terrible IO
performance.

This is also true for any pattern that will have more hits, but not more
misses in a row - those are extremely common pattern.

If you instead make the distance increase logic something like:
  distance = Max(4, distance * 2);
and have the decrease logic be something like

  /*
   * Only decrease distance if there are no IOs in the queue. As long as we
   * are asynchronously executing IO we are benefiting from the higher
   * distance.
   */
  if (stream->ios_in_progress == 0 && stream->distance > 1)
      stream->distance--;

You need a more adverse pattern to not get any readahead. Even if you look at
the most trivial adverse pattern ([(miss, hits{3})+]), you've reduced the
number of synchronous waits by 2x compared to the current worst case pattern.

And if you ever have two hits separated by less than three hits in a row,
distance will much more often not be decreased by the hits (because there's
still IO in progress when encountering a hit) and distance will increase
further (because we need to wait for IO).


Another thing that we probably ought to do is to perform lookahead when the
fast path encountered a miss (rather than doing so *after* waiting for the
IO), so that we at least amortize the cost of having to wait across multiple
IOs.


> IMHO the simplest solution with the current heuristics would be to not
> allow the distance to "drop" below a "small" value that is high enough
> to hide the AIO overhead.

Unfortunately that has too big a performance penalty for fully cached
workloads :(. Doing buffer mapping lookups ahead of the current point is not
free.


We could probably do something much more aggressive than what I described
above without meaningful negative impacts, substantially reducing the maximum
negative impact of decreasing the distance. E.g. a separate cooloff counter
that needs to be decremented before the distance can be decreased after a
miss.  If we e.g. were to only allow to decrease distance after 32 buffer
hits, at a time where no IO is in progress, the worst case #waits is a lot
lower than today.

Greetings,

Andres Freund

Re: index prefetching

Reply via email to