On 3/2/26 10:00, Alexandre Felipe wrote:
> 
> 
> On Sun, Mar 1, 2026 at 11:33 PM Tomas Vondra <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     On 3/1/26 23:32, Alexandre Felipe wrote:
>     >
>     > On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected]
>     <mailto:[email protected]>
>     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>     >
>     >     Hi,
>     >
>     >     I've decided to run a couple tests, trying to reproduce some
>     of the
>     >     behaviors described in your (Felipe's) messages.
>     >
>     >
>     > Thank you,
>     > I will look into this data later. I am impressed with the number of IO
>     > workers 
>     > you used, my test was typically with 3.
>     >
> 
>     3 is extremely low for an I/O bound system. It's our tradition to pick
>     defaults that work even on tiny systems, but need tuning on actual
>     non-toy systems :-(
> 
> 
> That is was a surprise for me, because I am used to javascript
> that does everything in one single process (with a coroutine 
> async model) and does with very little overhead.
> 

Well, we don't have coroutines (or other threads). If you want something
similarly lightweight, you need to use io_uring. Which means you need to
be on Linux, unfortunately.

> Cold Cache (buffer eviction before each run):
> io pf=off pf=on d<=16 d<=64 d<=128
>  3 0.68s  1.20s 1.29s 0.78s 0.68s
> 10 0.75s  1.02s 1.51s 1.62s 0.82s
> 30 0.75s  0.79s 2.95s 1.65s 1.43s
> 
> Warm Cache (no eviction):
> io pf=off pf=on d<=16 d<=64 d<=128
>  3  0.42s 0.67s 0.40s 0.41s 0.41s
> 10  0.53s 0.96s 0.39s 0.38s 0.44s
> 30  0.39s 0.76s 0.39s 0.39s 0.40s
> 

I don't know what these numbers are / from what test.

>
>     Sure, but even then it'd be better to have more details about the
>     hardware. "M1" doesn't really say much (especially to people who don't
>     use Apple stuff very much). There's a range of M1-based systems, from
>     MacBook Air/Pro, Mini, Studio, ...
> 
>  
> I tried to find out some details of the hardware. It shows that I didn't
> even know the (marketing) model name...
> 
> Chip: Apple M2, 8 (physical) cores, 24 GB, macOS Tahoe 26.1 
> Apple SSD AP2048Z (2 TB NVMe), block size 4kb, protocol: Apple Fabric
> 
OK

>     >
>     > Well, I was hoping to be able to create a self balancing mechanism
>     > in read_stream_next_buffer
>     >
>     >  /* Do we have to wait for an associated I/O first? */
>     > if (stream->ios_in_progress > 0 &&
>     > stream->ios[stream->oldest_io_index].buffer_index ==
>     oldest_buffer_index)
>     > {
>     >   // prefetch and increase the distance while we wait here
>     > WaitReadBuffers(&stream->ios[io_index].op);
>     >  ...
>     > }
>     > ...
>     > // this call could be removed if we prefetched earlier.
>     > read_stream_look_ahead(stream);
>     >
>     >
>     > There same principle that guided the 
>     >> Don't wait for already in-progress IO
>     > patch. Here we should prioritise increasing the distance, and if
>     it is not
>     > possible (maybe we consumed all the buffers). We could take the 
>     > opportunity to yield.
>     >
>     >
> 
>     IIUC the idea would be to (automatically) increase the distance just
>     enough so that the IOs complete right before we actually need the
>     buffers? That reminds me the "IO cost model" I mentioned a couple days
>     ago [1]. But it's not clear to me how this profiling helps with it.
> 
> 
> Exactly, actually my investigation started from your (Tomas)
> suggestion, more specifically, this part:
> 
>> I think a "proper" solution would require some sort of cost model for
>> the I/O part, so that we can schedule the I/Os just so that the I/O
>> completes right before we actually need the page.
> 
> We are controlling the distance because we can, even knowing 
> that that is not what we need. We need to prefetch enough
> to make sure the buffers will be ready when we need, but
> on the other hand we need to minimise the reads of buffers
> we don't need.
> 

OK, but how does the instrumentation help with this?

> It is clear that reading unnecessary blocks could lead to 
> degradation of other queries in two ways. (1) delaying
> concurrent tasks that require I/O, be it queries, maintenance,
> replication. (2) future queries, by unnecessarily evicting
> buffers.

Agreed on (1). This would be a problem e.g. is a huge query "fills" the
IO queue with requests, and small queries that only need a couple blocks
need to wait much longer.

I don't think (2) would be a huge problem, though. It's extremely timing
sensitive - if the second query runs a second later, the blocks will be
gone anyway. If your "cache hit ratio" depends on this, you'll have a
bad time. It's also possible the second query will use the new data, in
which case it's better to evict sooner.

> If the system have queries reading on average N pages, and
> we read just one additional page we get 1/N overhead. For 
> N=100 this is negligible, for N<=2 it is not.
> 

Isn't it wasted only if you don't actually use the page later? Which for
most queries is not the case - the query will access the page, no matter
how early it prefetches it.

> Yielding is important, but it is also important that the
> executor yield to the prefetch again before the distance
> is too low.
> 

That's up to the read stream, and it will try to prefetch again on the
next buffer read. There's not much this patch can do.

> DISTANCE INCREASE
> 
> The current distance increase mechanism captures the idea of
> _completing right before we actually need the page_ by
> increasing distance when a wait is required (well, reflecting
> about it, I am not sure if that condition is free of false
> positives). I think that increasing exponentially is too 
> aggressive. If waited for 10 buffers in a row you will have
> a distance of ~1000 (potential waste of 100x).
> Wouldn't it make more sense to increased by blocks in 
> increments based on I/O combine limit and I/O concurrency?
> That way the waste is limited by a constant factor of the
> required buffers.
> 

I don't follow. The current distance heuristics does not consider this
at all. It only looks at buffer hits and misses (before issuing the IO),
which is an entirely different thing.

It would need to check if the page is already in shared buffers in
read_stream_next_buffer(), or if it needs to wait for the I/O to
complete. And it does not do that.

I also think you're missing the difference between the 'distance' and
effective_io_concurrency. The 'distance' does not determine how many
blocks we request in advance, it determines how far ahead we're allowed
to look for I/O. The number of concurrent I/O is still limited by
effective_io_concurrency, which is 16 by default. So maybe you have
distance=1000, but if the next 16 blocks are not in shared buffers,
you'll only submit IO for the next 16 blocks.

> DISTANCE DECREASE
> 
> Also the interpretation of distance going up is different
> from the interpretation of distance going down.
> When it goes up it is an upper bound for the distance
> when it goes down it tracks the actual distance.
> In this sense I agree with something written in
> 
> This could explain Andres Freund observation [1]
> 
>> It seems caused to a significant degree by waiting at low queue
> depths. If I
>> comment out the stream->distance-- in read_stream_start_pending_read() the
>> regression is reduced greatly.
> 

Yes, because the distance heuristics is based solely on buffer hits and
misses (and not on some direct feedback if we're prefetching far enough).

For any algorithm that increases/decreases the distance (e.g. the
current distance*2 and distance--) there is a data patterns that
collapses to 1 too early. It just takes the right fraction of hits and
misses. (If we ignore "trivial" heuristics that never decreases the
distance etc.)

And once you get to distance=1, you're doing a synchronous I/O with the
AIO overhead (signaling workers etc.). Which is what Andres described.

IMHO the simplest solution with the current heuristics would be to not
allow the distance to "drop" below a "small" value that is high enough
to hide the AIO overhead.

> 
> CLOSED LOOP THOUGHTS
> 
> Here I dump some ideas that we could work on, but but again,
> this is something that could go on a separate commit, without
> blocking this feature.
> 

There's no way we can include this in the current commit. The patch is
already large, and it's a far too large change, too late in the cycle.

> Let distance D of a scan that will make sure a given scan
> have the buffers ready before consuming. For a long enough scan
> we can find d >= D.
> 
> Label each operation with the distance (d_io) when it was started.
> Every time we wait for a buffer, we know that d_io < D (by definition)
> set d = max(d, d_io). Notice that the current model increments the
> distance based from its current value not the distance when the I/O
> started, this gives room for even more overshoot, but the prefetch
> is not immediate, probably that is damping it, and I speculate that
> the distance decay was added as a work around for this.
> 
> 
> On the other hand, decreasing distance when we consume a buffer
> will bring us again to risky ground, below the value that was 
> previously seen to be necessary to ensure no waits. Maybe this
> decay was introduced just to counteract the excessive increase
> due to feedback on the current distance instead of the distance
> when the operations started.
> 

The decay is there because the hit/miss ratio can vary over time (in the
same scan), and there's no need to prefetch very far when everything is
in cache.

> Increasing exponentially is that we may have a long gap between the
> computed distance and the actual required distance. Increasing
> it by steps based on maximum I/O concurrency we would still start
> as much I/O as could be useful, while keeping the gap between
> the distance and the optimal distance small, it could also make it
> even faster (just a little bit), by increasing the concurrency on
> the initial reads when distance would be 1,2,4,8.
> 

I don't follow. Why would an exponential distance increase have a long
gap, but increasing it by fixed steps would not?

> TIME AGNOSTIC MODEL
> 
> A proper model would have distinguish the distance of the I/O being
> started, the I/O operation being finished, and the I/O distance that 
> we are aiming for.
> 
> start read:
>   save io.distance = number of pinned buffers before this.
> loop:
>   if(have to wait)
>     target_distance = io.distance + concurrency;
>     start_prefetch
>   if(distance <= min_distance)
>     distance += prefetch(concurrency, combine_limit)
>     if(buffer consumed)
>       distance--
> 
> TIME AWARE MODEL
> 
> The IO model could take delays in consideration. I know that
> timing has a cost, but hopefully it would be acceptable
> to add the timestamp of the IO when it starts. Then we could
> build a model based on an approximation of some quantile of the
> distance. If we know e.g. 50% of the IOs take longer
> than T(50%), then we could time the first few buffer consumption
> to to have a guess, say, tc is the average time per buffer 
> required for this scan on this query. then T(50%) / tc, is a 
> a distance that puts us within a 50% chance of requiring a read.
> 
> But in order for this model to be effective we should keep it
> global, if we make it per query when we tuned the parameters the
> query already went too far, it could be saved in the index of stats,
> along with the number of scans, tuples heap fetches.
> 
> A simpler, local version could be based on the time of the first
> IO, but how do we time it? Currently we don't have a callback
> to register when it is ready. Let's say that we time it just
> after the WaitReadBuffers.
> 
> A pseudo code draft, that might be clearer than what I said
> 
> init:
> t_start=now();
> loop:
> buffer_count++;
> if(has to wait){
>   t_wait = now()
>   t_per_buffer = buffer_count / t_wait;
>   // is this wait blocking other scans??
>   WaitReadBuffer(io);
>   t_ready = now(); 
>   t_io = t_ready - t_io_start;
>   distance_lower_bound = Max(distance_min, io.distance);
>   distance_guess = Max(distance_guess, t_io / t_per_buffer);
>   t_start = t_ready;
>   buffer_count = 0;
> }
> 
> We have the timing overhead, but notice that this is independent 
> of the time unit, or the epoch, some CPU tick count could be used
> as a cheaper alternative. And we may do this just a few times
> 
> Distance lower bound would limit the distance decay, we already 
> know that we had to wait for an IO started at io.distance, so,
> let's stay above that (unless we approach the scan limit).
> 
> I hope this makes sense.
> 
Perhaps. Some feedback like this might work, but I have no opinion on
which of those approaches it the best one. I'd have to see some PoC
patches, etc.


regards

-- 
Tomas Vondra



Reply via email to