Hi,
On 2026-03-02 14:18:31 +0100, Tomas Vondra wrote:
> On 3/2/26 10:00, Alexandre Felipe wrote:
> >
> >
> > On Sun, Mar 1, 2026 at 11:33 PM Tomas Vondra <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> > On 3/1/26 23:32, Alexandre Felipe wrote:
> > >
> > > On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected]
> > <mailto:[email protected]>
> > > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> > >
> > > Hi,
> > >
> > > I've decided to run a couple tests, trying to reproduce some
> > of the
> > > behaviors described in your (Felipe's) messages.
> > >
> > >
> > > Thank you,
> > > I will look into this data later. I am impressed with the number of IO
> > > workers
> > > you used, my test was typically with 3.
> > >
> >
> > 3 is extremely low for an I/O bound system. It's our tradition to pick
> > defaults that work even on tiny systems, but need tuning on actual
> > non-toy systems :-(
> >
> >
> > That is was a surprise for me, because I am used to javascript
> > that does everything in one single process (with a coroutine
> > async model) and does with very little overhead.
> >
>
> Well, we don't have coroutines (or other threads). If you want something
> similarly lightweight, you need to use io_uring. Which means you need to
> be on Linux, unfortunately.
Even if we had coroutines, there's no portable way of integrating them with
file IO. For that you need either something like io_uring, IOCP or posix_aio
(although the latter performs so terribly on most platforms, including macos,
it's not worth using).
> > This could explain Andres Freund observation [1]
> >
> >> It seems caused to a significant degree by waiting at low queue
> > depths. If I
> >> comment out the stream->distance-- in read_stream_start_pending_read() the
> >> regression is reduced greatly.
> >
>
> Yes, because the distance heuristics is based solely on buffer hits and
> misses (and not on some direct feedback if we're prefetching far enough).
>
> For any algorithm that increases/decreases the distance (e.g. the
> current distance*2 and distance--) there is a data patterns that
> collapses to 1 too early. It just takes the right fraction of hits and
> misses. (If we ignore "trivial" heuristics that never decreases the
> distance etc.)
I think this is making the problem sound harder than it has to be to be a
significant im provement:
Right now an IO pattern like [(miss, hit)+] will oscillate between distance =
0 and 1, and thus will do all IO synchronuously. I.e. every second required
buffer will require a synchronous read - you'll have a terrible IO
performance.
This is also true for any pattern that will have more hits, but not more
misses in a row - those are extremely common pattern.
If you instead make the distance increase logic something like:
distance = Max(4, distance * 2);
and have the decrease logic be something like
/*
* Only decrease distance if there are no IOs in the queue. As long as we
* are asynchronously executing IO we are benefiting from the higher
* distance.
*/
if (stream->ios_in_progress == 0 && stream->distance > 1)
stream->distance--;
You need a more adverse pattern to not get any readahead. Even if you look at
the most trivial adverse pattern ([(miss, hits{3})+]), you've reduced the
number of synchronous waits by 2x compared to the current worst case pattern.
And if you ever have two hits separated by less than three hits in a row,
distance will much more often not be decreased by the hits (because there's
still IO in progress when encountering a hit) and distance will increase
further (because we need to wait for IO).
Another thing that we probably ought to do is to perform lookahead when the
fast path encountered a miss (rather than doing so *after* waiting for the
IO), so that we at least amortize the cost of having to wait across multiple
IOs.
> IMHO the simplest solution with the current heuristics would be to not
> allow the distance to "drop" below a "small" value that is high enough
> to hide the AIO overhead.
Unfortunately that has too big a performance penalty for fully cached
workloads :(. Doing buffer mapping lookups ahead of the current point is not
free.
We could probably do something much more aggressive than what I described
above without meaningful negative impacts, substantially reducing the maximum
negative impact of decreasing the distance. E.g. a separate cooloff counter
that needs to be decremented before the distance can be decreased after a
miss. If we e.g. were to only allow to decrease distance after 32 buffer
hits, at a time where no IO is in progress, the worst case #waits is a lot
lower than today.
Greetings,
Andres Freund