On Sun, Mar 1, 2026 at 11:33 PM Tomas Vondra <[email protected]> wrote:
> On 3/1/26 23:32, Alexandre Felipe wrote: > > > > On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hi, > > > > I've decided to run a couple tests, trying to reproduce some of the > > behaviors described in your (Felipe's) messages. > > > > > > Thank you, > > I will look into this data later. I am impressed with the number of IO > > workers > > you used, my test was typically with 3. > > > > 3 is extremely low for an I/O bound system. It's our tradition to pick > defaults that work even on tiny systems, but need tuning on actual > non-toy systems :-( > That is was a surprise for me, because I am used to javascript that does everything in one single process (with a coroutine async model) and does with very little overhead. Cold Cache (buffer eviction before each run): io pf=off pf=on d<=16 d<=64 d<=128 3 0.68s 1.20s 1.29s 0.78s 0.68s 10 0.75s 1.02s 1.51s 1.62s 0.82s 30 0.75s 0.79s 2.95s 1.65s 1.43s Warm Cache (no eviction): io pf=off pf=on d<=16 d<=64 d<=128 3 0.42s 0.67s 0.40s 0.41s 0.41s 10 0.53s 0.96s 0.39s 0.38s 0.44s 30 0.39s 0.76s 0.39s 0.39s 0.40s > Sure, but even then it'd be better to have more details about the > hardware. "M1" doesn't really say much (especially to people who don't > use Apple stuff very much). There's a range of M1-based systems, from > MacBook Air/Pro, Mini, Studio, ... I tried to find out some details of the hardware. It shows that I didn't even know the (marketing) model name... Chip: Apple M2, 8 (physical) cores, 24 GB, macOS Tahoe 26.1 Apple SSD AP2048Z (2 TB NVMe), block size 4kb, protocol: Apple Fabric > > > > Well, I was hoping to be able to create a self balancing mechanism > > in read_stream_next_buffer > > > > /* Do we have to wait for an associated I/O first? */ > > if (stream->ios_in_progress > 0 && > > stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index) > > { > > // prefetch and increase the distance while we wait here > > WaitReadBuffers(&stream->ios[io_index].op); > > ... > > } > > ... > > // this call could be removed if we prefetched earlier. > > read_stream_look_ahead(stream); > > > > > > There same principle that guided the > >> Don't wait for already in-progress IO > > patch. Here we should prioritise increasing the distance, and if it is > not > > possible (maybe we consumed all the buffers). We could take the > > opportunity to yield. > > > > > > IIUC the idea would be to (automatically) increase the distance just > enough so that the IOs complete right before we actually need the > buffers? That reminds me the "IO cost model" I mentioned a couple days > ago [1]. But it's not clear to me how this profiling helps with it. > Exactly, actually my investigation started from your (Tomas) suggestion, more specifically, this part: > I think a "proper" solution would require some sort of cost model for > the I/O part, so that we can schedule the I/Os just so that the I/O > completes right before we actually need the page. We are controlling the distance because we can, even knowing that that is not what we need. We need to prefetch enough to make sure the buffers will be ready when we need, but on the other hand we need to minimise the reads of buffers we don't need. It is clear that reading unnecessary blocks could lead to degradation of other queries in two ways. (1) delaying concurrent tasks that require I/O, be it queries, maintenance, replication. (2) future queries, by unnecessarily evicting buffers. If the system have queries reading on average N pages, and we read just one additional page we get 1/N overhead. For N=100 this is negligible, for N<=2 it is not. Yielding is important, but it is also important that the executor yield to the prefetch again before the distance is too low. DISTANCE INCREASE The current distance increase mechanism captures the idea of _completing right before we actually need the page_ by increasing distance when a wait is required (well, reflecting about it, I am not sure if that condition is free of false positives). I think that increasing exponentially is too aggressive. If waited for 10 buffers in a row you will have a distance of ~1000 (potential waste of 100x). Wouldn't it make more sense to increased by blocks in increments based on I/O combine limit and I/O concurrency? That way the waste is limited by a constant factor of the required buffers. DISTANCE DECREASE Also the interpretation of distance going up is different from the interpretation of distance going down. When it goes up it is an upper bound for the distance when it goes down it tracks the actual distance. In this sense I agree with something written in This could explain Andres Freund observation [1] > It seems caused to a significant degree by waiting at low queue depths. If I > comment out the stream->distance-- in read_stream_start_pending_read() the > regression is reduced greatly. CLOSED LOOP THOUGHTS Here I dump some ideas that we could work on, but but again, this is something that could go on a separate commit, without blocking this feature. Let distance D of a scan that will make sure a given scan have the buffers ready before consuming. For a long enough scan we can find d >= D. Label each operation with the distance (d_io) when it was started. Every time we wait for a buffer, we know that d_io < D (by definition) set d = max(d, d_io). Notice that the current model increments the distance based from its current value not the distance when the I/O started, this gives room for even more overshoot, but the prefetch is not immediate, probably that is damping it, and I speculate that the distance decay was added as a work around for this. On the other hand, decreasing distance when we consume a buffer will bring us again to risky ground, below the value that was previously seen to be necessary to ensure no waits. Maybe this decay was introduced just to counteract the excessive increase due to feedback on the current distance instead of the distance when the operations started. Increasing exponentially is that we may have a long gap between the computed distance and the actual required distance. Increasing it by steps based on maximum I/O concurrency we would still start as much I/O as could be useful, while keeping the gap between the distance and the optimal distance small, it could also make it even faster (just a little bit), by increasing the concurrency on the initial reads when distance would be 1,2,4,8. TIME AGNOSTIC MODEL A proper model would have distinguish the distance of the I/O being started, the I/O operation being finished, and the I/O distance that we are aiming for. start read: save io.distance = number of pinned buffers before this. loop: if(have to wait) target_distance = io.distance + concurrency; start_prefetch if(distance <= min_distance) distance += prefetch(concurrency, combine_limit) if(buffer consumed) distance-- TIME AWARE MODEL The IO model could take delays in consideration. I know that timing has a cost, but hopefully it would be acceptable to add the timestamp of the IO when it starts. Then we could build a model based on an approximation of some quantile of the distance. If we know e.g. 50% of the IOs take longer than T(50%), then we could time the first few buffer consumption to to have a guess, say, tc is the average time per buffer required for this scan on this query. then T(50%) / tc, is a a distance that puts us within a 50% chance of requiring a read. But in order for this model to be effective we should keep it global, if we make it per query when we tuned the parameters the query already went too far, it could be saved in the index of stats, along with the number of scans, tuples heap fetches. A simpler, local version could be based on the time of the first IO, but how do we time it? Currently we don't have a callback to register when it is ready. Let's say that we time it just after the WaitReadBuffers. A pseudo code draft, that might be clearer than what I said init: t_start=now(); loop: buffer_count++; if(has to wait){ t_wait = now() t_per_buffer = buffer_count / t_wait; // is this wait blocking other scans?? WaitReadBuffer(io); t_ready = now(); t_io = t_ready - t_io_start; distance_lower_bound = Max(distance_min, io.distance); distance_guess = Max(distance_guess, t_io / t_per_buffer); t_start = t_ready; buffer_count = 0; } We have the timing overhead, but notice that this is independent of the time unit, or the epoch, some CPU tick count could be used as a cheaper alternative. And we may do this just a few times Distance lower bound would limit the distance decay, we already know that we had to wait for an IO started at io.distance, so, let's stay above that (unless we approach the scan limit). I hope this makes sense. I wrote about the signal overhead a couple months ago, with a simple > benchmark simulating it [.]. I also wrote a brief explanation [.] about > the AIO in PG18, which mentions that too. > Added to pending reads :) [1] https://www.postgresql.org/message-id/qdl4fojnbfcnm2k7b4zpvgd6gwzwdgtbl5c7shpimrb76dbyy6%40scdnspus3ejh
