On 8/20/25 00:27, Peter Geoghegan wrote:
> On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <p...@bowt.ie> wrote:
>> That definitely seems like a problem. I think that you're saying that
>> this problem happens because we have extra buffer hits earlier on,
>> which is enough to completely change the ramp-up behavior. This seems
>> to be all it takes to dramatically decrease the effectiveness of
>> prefetching. Does that summary sound correct?
> 

That summary is correct, yes. I kept thinking about this, while looking
at more regressions found by my script (that generates data sets with
different data distributions, etc.).

Almost all regressions (at least the top ones) now look like this, i.e.
distance collapses to ~2.0, which essentially disables prefetching.

But I no longer think it's caused by the "priorbatch" optimization,
which delays read stream creation until after the first batch. I still
think we may need to rethink that (e.g. if the first batch is huge), but
he distance can "collapse" even without it. The optimization just makes
it easier to happen.

AFAICS the distance collapse is "inherent" to how the distance gets
increased/decreased after hits/misses.

Let's start with distance=1, and let's assume 50% of buffers are hits,
in a regular pattern - hit-miss-hit-miss-hit-miss-...

In this case, the distance will never increase beyond 2, because we'll
double-decrement-double-decrement-... so it'll flip between 1 and 2, no
matter how you set effective_io_concurrency.

Of course, this can happen even with other hit ratios, there's nothing
special about 50%.

With fewer hits, it's fine - there's asymmetry, because the distance
grows by doubling and decreases by decrementing 1. So once we have a bit
more misses, it keeps growing.

But with more hits, the hit/miss ratio simply determines the "stable"
distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the
stable distance is ~4, because we get a miss, double to 8, and then 4
hits, so the distance drops back to 4. And again.

Similarly for other hit/miss ratios (it's easier to think about if you
keep the number of hits 2^n).

It's worth noticing the effective_io_concurrency has almost no impact on
what distance we end up with, it merely limits the maximum distance.

I find this distance heuristics a bit strange, for a couple reasons:

* It doesn't seem right to get stuck at distance=2 with 50% misses.
Surely that would benefit from prefetching a bit more?

* It mostly ignores effective_io_concurrency, which I think about as
"Keep this number of I/Os in the queue." But we don't try doing that.

I understand the current heuristics is trying to not prefetch for cached
data sets, but does that actually make sense? With fadvise it made
sense, because the prefetched data could get evicted if we prefetched
too far ahead. But with worker/io_uring the buffers get pinned, so this
shouldn't happen. Of course, that doesn't mean we should prefetch too
far ahead - there's LIMIT queries and limit of buffer pins, etc.

What about if the distance heuristics asks this question:

  How far do we need to look to generate effective_io_concurrency IOs?

The attached patch is a PoC implementing this. The core idea is that if
we measure "miss probability" for a chunk of requests, we can use that
to estimate the distance needed to generate e_i_c IOs.

So while the current heuristics looks at individual hits/misses, the
patch looks at groups of requests.

The other idea is that the patch maintains a "distance range", with
min/max of allowed distances. The min/max values gradually grow after a
miss, the "min" value "stops" at max_ios, while "max" grows further.

This ensures gradual ramp up, helping LIMIT queries etc.

And even if there are a lot of hits, the distance is not allowed to drop
below the current "min". Because what would be the benefit of that?

- If the read is a hit, we might read it later - but the cost is about
the same, we're not really saving much by delaying the read.

- If the read is a miss, it's clearly better to issue the I/O sooner.

This may not be true if it's a LIMIT query, and it terminates early. But
if the distance_min is not too high, this should be negligible.

Attached is an example table/query, found by my script. Without the
read_stream patch (i.e. just with the current index prefetching), it
looks like this:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 2.032
     Prefetch Count: 868165
     Prefetch Stalls: 2140228
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 0
     Stream Forwarded: 4
     Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26 dirtied=1
   Planning Time: 1.032 ms
   Execution Time: 3150.578 ms
  (16 rows)

and with the attached patch:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t  (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 36.321
     Prefetch Count: 3730750
     Prefetch Stalls: 3
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 722353
     Stream Forwarded: 305265
     Prefetch Histogram: [2,4) => 10, [4,8) => 11, [8,16) => 6,
                         [16,32) => 316890, [32,64) => 3413833
     Buffers: shared hit=2574776 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 2.249 ms
   Execution Time: 1651.826 ms
  (16 rows)

The example is not entirely perfect, because the index prefetching does
not actually beat master:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t   (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 3.688 ms
   Execution Time: 1656.790 ms
  (8 rows)

So it's more a case of "mitigating a regression" (finding regressions
like this is the purpose of my script). Still, I believe the questions
about the distance heuristics are valid.

(Another interesting detail is that the regression happens only with
io_method=worker, not with io_uring. I'm not sure why.)


regards

-- 
Tomas Vondra
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index ed5feac2d39..0ea77dff3fc 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -116,6 +116,15 @@ struct ReadStream
 	int64		forwarded_count;
 	int64		distance_hist[16];
 
+	/* acceptable distance range */
+	int16		distance_min;
+	int16		distance_max;
+
+	/* number of hits / misses */
+	int16		num_reads;
+	int16		num_hits;
+	int16		threshold;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -235,6 +244,117 @@ read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 	return blocknum;
 }
 
+/*
+ * read_stream_maybe_adjust_distance
+ *		adjust distance based on the number of hits/misses
+ *
+ * This adjusts three parameters used to pick the distance, based on hits and
+ * misses when reading the buffers:
+ *
+ * - distance - the "actual distance"
+ *
+ * - distance_min and distance_max - this restricts the accetable range
+ *   for the "actual distance"
+ *
+ * The range is adjusted with every miss. It starts with [1,1], and every miss
+ * increases the limits up to [max_ios, PG_INT16_MAX]. The "actual distance" is
+ * kept within the range. This means the distance "ramps up" gradually, and does
+ * not drop all the way back to 1.
+ *
+ * The "actual distance" is adjusted less frequently, after seeing a chunk of
+ * requests. We calculate the "probability of a miss" and use it to estimate how
+ * many requests to look ahead to keep "max_ios" in the queue. The calculated
+ * distance is still kept in the min/max range.
+ */
+static inline void
+read_stream_adjust_distance(ReadStream *stream, bool miss)
+{
+	/*
+	 * Count hits/misses and (maybe) widen the distance range.
+	 *
+	 * XXX Maybe the range should be adjusted always, not just for a miss?
+	 *
+	 * XXX The min distance is capped to max_ios, because that's the maximum
+	 * number we know we can handle for 100% miss rate.
+	 */
+	if (miss)
+	{
+		stream->num_reads++;
+		stream->distance_min = Min(stream->distance_min * 2,
+								   stream->max_ios);
+		stream->distance_max = Min(stream->distance_max * 2,
+								   PG_INT16_MAX);
+	}
+	else
+	{
+		stream->num_hits++;
+	}
+
+	/*
+	 * Adjust the actual distance, based on miss ratio.
+	 *
+	 * We only do this once in a while, after seeing "threshold" requests, so
+	 * that we have somewhat accurate estimate of miss ratio. We can still see
+	 * miss_prob=0.0, so be careful about it.
+	 */
+	if ((stream->num_hits + stream->num_reads) >= stream->threshold)
+	{
+		/*
+		 * If we saw any misses, estimate how far to look ahead to see max_ios
+		 * I/Os (which considers effective_io_concurrency, our goal).
+		 */
+		if (stream->num_reads > 0)
+		{
+			/* probability of a miss */
+			double	miss_prob
+				= stream->num_reads * 1.0 / (stream->num_hits + stream->num_reads);
+
+			/* number of requests to get max_ios misses */
+			stream->distance = Min(stream->max_ios / miss_prob,
+								   PG_INT16_MAX);
+
+			stream->distance = Min(stream->distance, stream->max_pinned_buffers);
+		}
+		else
+		{
+			/*
+			 * With no misses, we simply use the current minimal distance.
+			 *
+			 * XXX Maybe we should use the maximum instead?
+			 */
+			stream->distance = stream->distance_min;
+		}
+
+		/* reset the counters, to start a new interval */
+		stream->num_hits = 0;
+		stream->num_reads = 0;
+
+		/*
+		 * When to re-calculate the distance? Not too often, to get a good
+		 * of miss probability (we need to see enough requests), but also
+		 * not too infrequently (we want this to be adaptive).
+		 *
+		 * XXX Seems reasonable to base this on the distance. It means we
+		 * expect to see max_ios misses, because that's how we calculated the
+		 * distance.
+		 *
+		 * XXX But don't do this too infrequently. The distance can get quite
+		 * high, so cap to 10x the I/Os. Arbitrary value, maybe needs more
+		 * thought.
+		 */
+		stream->threshold = Max(stream->distance, stream->max_ios * 10);
+	}
+
+	/*
+	 * in any case, respect distance_min, distance_max
+	 *
+	 * XXX This means we actually adjust the distance after every miss, not just
+	 * after every stream->threshold requests. Is this a good idea?
+	 */
+	stream->distance = Max(stream->distance, stream->distance_min);
+	stream->distance = Min(stream->distance, stream->distance_max);
+}
+
 /*
  * In order to deal with buffer shortages and I/O limits after short reads, we
  * sometimes need to defer handling of a block we've already consumed from the
@@ -398,8 +518,7 @@ read_stream_start_pending_read(ReadStream *stream)
 	if (!need_wait)
 	{
 		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
-			stream->distance--;
+		read_stream_adjust_distance(stream, false);
 	}
 	else
 	{
@@ -758,6 +877,13 @@ read_stream_begin_impl(int flags,
 	else
 		stream->distance = 1;
 
+	stream->num_hits = 0;
+	stream->num_reads = 0;
+
+	stream->distance_min = 1;
+	stream->distance_max = 1;
+	stream->threshold = stream->max_ios;	/* XXX rethink? */
+
 	/*
 	 * Since we always access the same relation, we can initialize parts of
 	 * the ReadBuffersOperation objects and leave them that way, to avoid
@@ -971,7 +1097,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -985,9 +1110,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->oldest_io_index = 0;
 
 		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
+		read_stream_adjust_distance(stream, true);
 
 		/*
 		 * If we've reached the first block of a sequential region we're
@@ -1146,6 +1269,9 @@ read_stream_reset(ReadStream *stream)
 	stream->distance = Max(1, stream->distance_old);
 	stream->distance_old = 0;
 
+	stream->distance_min = stream->distance;
+	stream->distance_max = stream->distance;
+
 	/* track the number of resets */
 	stream->reset_count += 1;
 }

Attachment: test.sql
Description: application/sql

Reply via email to