On 8/20/25 00:27, Peter Geoghegan wrote: > On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <p...@bowt.ie> wrote: >> That definitely seems like a problem. I think that you're saying that >> this problem happens because we have extra buffer hits earlier on, >> which is enough to completely change the ramp-up behavior. This seems >> to be all it takes to dramatically decrease the effectiveness of >> prefetching. Does that summary sound correct? >
That summary is correct, yes. I kept thinking about this, while looking at more regressions found by my script (that generates data sets with different data distributions, etc.). Almost all regressions (at least the top ones) now look like this, i.e. distance collapses to ~2.0, which essentially disables prefetching. But I no longer think it's caused by the "priorbatch" optimization, which delays read stream creation until after the first batch. I still think we may need to rethink that (e.g. if the first batch is huge), but he distance can "collapse" even without it. The optimization just makes it easier to happen. AFAICS the distance collapse is "inherent" to how the distance gets increased/decreased after hits/misses. Let's start with distance=1, and let's assume 50% of buffers are hits, in a regular pattern - hit-miss-hit-miss-hit-miss-... In this case, the distance will never increase beyond 2, because we'll double-decrement-double-decrement-... so it'll flip between 1 and 2, no matter how you set effective_io_concurrency. Of course, this can happen even with other hit ratios, there's nothing special about 50%. With fewer hits, it's fine - there's asymmetry, because the distance grows by doubling and decreases by decrementing 1. So once we have a bit more misses, it keeps growing. But with more hits, the hit/miss ratio simply determines the "stable" distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the stable distance is ~4, because we get a miss, double to 8, and then 4 hits, so the distance drops back to 4. And again. Similarly for other hit/miss ratios (it's easier to think about if you keep the number of hits 2^n). It's worth noticing the effective_io_concurrency has almost no impact on what distance we end up with, it merely limits the maximum distance. I find this distance heuristics a bit strange, for a couple reasons: * It doesn't seem right to get stuck at distance=2 with 50% misses. Surely that would benefit from prefetching a bit more? * It mostly ignores effective_io_concurrency, which I think about as "Keep this number of I/Os in the queue." But we don't try doing that. I understand the current heuristics is trying to not prefetch for cached data sets, but does that actually make sense? With fadvise it made sense, because the prefetched data could get evicted if we prefetched too far ahead. But with worker/io_uring the buffers get pinned, so this shouldn't happen. Of course, that doesn't mean we should prefetch too far ahead - there's LIMIT queries and limit of buffer pins, etc. What about if the distance heuristics asks this question: How far do we need to look to generate effective_io_concurrency IOs? The attached patch is a PoC implementing this. The core idea is that if we measure "miss probability" for a chunk of requests, we can use that to estimate the distance needed to generate e_i_c IOs. So while the current heuristics looks at individual hits/misses, the patch looks at groups of requests. The other idea is that the patch maintains a "distance range", with min/max of allowed distances. The min/max values gradually grow after a miss, the "min" value "stops" at max_ios, while "max" grows further. This ensures gradual ramp up, helping LIMIT queries etc. And even if there are a lot of hits, the distance is not allowed to drop below the current "min". Because what would be the benefit of that? - If the read is a hit, we might read it later - but the cost is about the same, we're not really saving much by delaying the read. - If the read is a miss, it's clearly better to issue the I/O sooner. This may not be true if it's a LIMIT query, and it terminates early. But if the distance_min is not too high, this should be negligible. Attached is an example table/query, found by my script. Without the read_stream patch (i.e. just with the current index prefetching), it looks like this: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Prefetch Distance: 2.032 Prefetch Count: 868165 Prefetch Stalls: 2140228 Prefetch Skips: 6039906 Prefetch Resets: 0 Stream Ungets: 0 Stream Forwarded: 4 Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 Buffers: shared hit=2577599 read=455610 Planning: Buffers: shared hit=78 read=26 dirtied=1 Planning Time: 1.032 ms Execution Time: 3150.578 ms (16 rows) and with the attached patch: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Prefetch Distance: 36.321 Prefetch Count: 3730750 Prefetch Stalls: 3 Prefetch Skips: 6039906 Prefetch Resets: 0 Stream Ungets: 722353 Stream Forwarded: 305265 Prefetch Histogram: [2,4) => 10, [4,8) => 11, [8,16) => 6, [16,32) => 316890, [32,64) => 3413833 Buffers: shared hit=2574776 read=455610 Planning: Buffers: shared hit=78 read=26 Planning Time: 2.249 ms Execution Time: 1651.826 ms (16 rows) The example is not entirely perfect, because the index prefetching does not actually beat master: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Buffers: shared hit=2577599 read=455610 Planning: Buffers: shared hit=78 read=26 Planning Time: 3.688 ms Execution Time: 1656.790 ms (8 rows) So it's more a case of "mitigating a regression" (finding regressions like this is the purpose of my script). Still, I believe the questions about the distance heuristics are valid. (Another interesting detail is that the regression happens only with io_method=worker, not with io_uring. I'm not sure why.) regards -- Tomas Vondra
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index ed5feac2d39..0ea77dff3fc 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -116,6 +116,15 @@ struct ReadStream int64 forwarded_count; int64 distance_hist[16]; + /* acceptable distance range */ + int16 distance_min; + int16 distance_max; + + /* number of hits / misses */ + int16 num_reads; + int16 num_hits; + int16 threshold; + /* * One-block buffer to support 'ungetting' a block number, to resolve flow * control problems when I/Os are split. @@ -235,6 +244,117 @@ read_stream_get_block(ReadStream *stream, void *per_buffer_data) return blocknum; } +/* + * read_stream_maybe_adjust_distance + * adjust distance based on the number of hits/misses + * + * This adjusts three parameters used to pick the distance, based on hits and + * misses when reading the buffers: + * + * - distance - the "actual distance" + * + * - distance_min and distance_max - this restricts the accetable range + * for the "actual distance" + * + * The range is adjusted with every miss. It starts with [1,1], and every miss + * increases the limits up to [max_ios, PG_INT16_MAX]. The "actual distance" is + * kept within the range. This means the distance "ramps up" gradually, and does + * not drop all the way back to 1. + * + * The "actual distance" is adjusted less frequently, after seeing a chunk of + * requests. We calculate the "probability of a miss" and use it to estimate how + * many requests to look ahead to keep "max_ios" in the queue. The calculated + * distance is still kept in the min/max range. + */ +static inline void +read_stream_adjust_distance(ReadStream *stream, bool miss) +{ + /* + * Count hits/misses and (maybe) widen the distance range. + * + * XXX Maybe the range should be adjusted always, not just for a miss? + * + * XXX The min distance is capped to max_ios, because that's the maximum + * number we know we can handle for 100% miss rate. + */ + if (miss) + { + stream->num_reads++; + stream->distance_min = Min(stream->distance_min * 2, + stream->max_ios); + stream->distance_max = Min(stream->distance_max * 2, + PG_INT16_MAX); + } + else + { + stream->num_hits++; + } + + /* + * Adjust the actual distance, based on miss ratio. + * + * We only do this once in a while, after seeing "threshold" requests, so + * that we have somewhat accurate estimate of miss ratio. We can still see + * miss_prob=0.0, so be careful about it. + */ + if ((stream->num_hits + stream->num_reads) >= stream->threshold) + { + /* + * If we saw any misses, estimate how far to look ahead to see max_ios + * I/Os (which considers effective_io_concurrency, our goal). + */ + if (stream->num_reads > 0) + { + /* probability of a miss */ + double miss_prob + = stream->num_reads * 1.0 / (stream->num_hits + stream->num_reads); + + /* number of requests to get max_ios misses */ + stream->distance = Min(stream->max_ios / miss_prob, + PG_INT16_MAX); + + stream->distance = Min(stream->distance, stream->max_pinned_buffers); + } + else + { + /* + * With no misses, we simply use the current minimal distance. + * + * XXX Maybe we should use the maximum instead? + */ + stream->distance = stream->distance_min; + } + + /* reset the counters, to start a new interval */ + stream->num_hits = 0; + stream->num_reads = 0; + + /* + * When to re-calculate the distance? Not too often, to get a good + * of miss probability (we need to see enough requests), but also + * not too infrequently (we want this to be adaptive). + * + * XXX Seems reasonable to base this on the distance. It means we + * expect to see max_ios misses, because that's how we calculated the + * distance. + * + * XXX But don't do this too infrequently. The distance can get quite + * high, so cap to 10x the I/Os. Arbitrary value, maybe needs more + * thought. + */ + stream->threshold = Max(stream->distance, stream->max_ios * 10); + } + + /* + * in any case, respect distance_min, distance_max + * + * XXX This means we actually adjust the distance after every miss, not just + * after every stream->threshold requests. Is this a good idea? + */ + stream->distance = Max(stream->distance, stream->distance_min); + stream->distance = Min(stream->distance, stream->distance_max); +} + /* * In order to deal with buffer shortages and I/O limits after short reads, we * sometimes need to defer handling of a block we've already consumed from the @@ -398,8 +518,7 @@ read_stream_start_pending_read(ReadStream *stream) if (!need_wait) { /* Look-ahead distance decays, no I/O necessary. */ - if (stream->distance > 1) - stream->distance--; + read_stream_adjust_distance(stream, false); } else { @@ -758,6 +877,13 @@ read_stream_begin_impl(int flags, else stream->distance = 1; + stream->num_hits = 0; + stream->num_reads = 0; + + stream->distance_min = 1; + stream->distance_max = 1; + stream->threshold = stream->max_ios; /* XXX rethink? */ + /* * Since we always access the same relation, we can initialize parts of * the ReadBuffersOperation objects and leave them that way, to avoid @@ -971,7 +1097,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index) { int16 io_index = stream->oldest_io_index; - int32 distance; /* wider temporary value, clamped below */ /* Sanity check that we still agree on the buffers. */ Assert(stream->ios[io_index].op.buffers == @@ -985,9 +1110,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) stream->oldest_io_index = 0; /* Look-ahead distance ramps up rapidly after we do I/O. */ - distance = stream->distance * 2; - distance = Min(distance, stream->max_pinned_buffers); - stream->distance = distance; + read_stream_adjust_distance(stream, true); /* * If we've reached the first block of a sequential region we're @@ -1146,6 +1269,9 @@ read_stream_reset(ReadStream *stream) stream->distance = Max(1, stream->distance_old); stream->distance_old = 0; + stream->distance_min = stream->distance; + stream->distance_max = stream->distance; + /* track the number of resets */ stream->reset_count += 1; }
test.sql
Description: application/sql