Hi,

I was wondering why the "simple" approach performs so much worse than
the "complex" one on some of the data sets. The theory was that it's due
to using read_stream_reset(), which resets the prefetch distance, and so
we need to "ramp up" from scratch (distance=1) for every batch. Which
for the correlated data sets is very often.

So I decided to do some experiments, to see if this is really the case,
and maybe see if read_stream_reset() could fix this in some way.

First, I added an

    elog(LOG, "distance %d", stream->distance);

at the beginning of read_stream_next_block() to see how the distance
changes during the scan. Consider a query returning 2M rows from the
"cyclic" table (the attached .sql creates/pupulates it):

   -- selects 20% rows
   SELECT * FROM cyclic WHERE a BETWEEN 0 AND 20000;

With the "complex" patch, the CDF of the distance looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 25       | 0   |
| 50       | 0   |
| 75       | 0   |
| 100      | 0   |
| 125      | 0   |
| 150      | 0   |
| 175      | 0   |
| 200      | 0   |
| 225      | 0   |
| 250      | 0   |
| 275      | 99  |
| 300      | 99  |
+----------+-----+

That is, 99% of the distances is in the range [275, 300].

Note: This is much higher than the effective_io_concurrency value (16),
which may be surprising. But the ReadStream uses that to limit the
number of I/O requests, not as a limit of how far to look ahead. A lot
of the blocks are in the cache, so it looks far ahead.

But with the "simple" patch it looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 25       | 99  |
| 50       | 99  |
| 75       | 99  |
| 100      | 99  |
| 125      | 99  |
| 150      | 99  |
| 175      | 99  |
| 200      | 99  |
| 225      | 99  |
| 250      | 99  |
| 275      | 100 |
| 300      | 100 |
+----------+-----+

So 99% of the distances is in [0, 25]. A more detailed view on the first
couple distances:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 1        | 99  |
| 2        | 99  |
| 3        | 99  |
| 4        | 99  |
...

So 99% of distances is 1. Well, that's not very far, it effectively
means no prefetching (We still issue the fadvise, though, although a
comment in read_stream.c suggests we won't. Possible bug?).

This means *there's no ramp-up at all*. On the first leaf the distance
grows to ~270, but after the stream gets reset it stays at 1 and never
increases. That's ... not great?

I'm not entirely sure

I decided to hack the ReadStream a bit, so that it restores the last
non-zero distance seen (i.e. right before reaching end of the stream).
And with that I got this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 25       | 38  |
| 50       | 38  |
| 75       | 38  |
| 100      | 39  |
| 125      | 42  |
| 150      | 47  |
| 175      | 47  |
| 200      | 48  |
| 225      | 49  |
| 250      | 50  |
| 275      | 100 |
| 300      | 100 |
+----------+-----+

Not as good as the "complex" patch, but much better than the original.
And the performance got almost the same (for this one query).

Perhaps the ReadStream should do something like this? Of course, the
simple patch resets the stream very often, likely mcuh more often than
anything else in the code. But wouldn't it be beneficial for streams
reset because of a rescan? Possibly needs to be optional.


regards

-- 
Tomas Vondra
From 817d5a7f7d19d0d9fe75ba1bb00a8164a438021a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 18 Jul 2025 16:41:03 +0200
Subject: [PATCH] read_stream: restore prefetch distance after reset

---
 src/backend/storage/aio/read_stream.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..796791f8427 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -443,6 +444,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -841,6 +843,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1012,6 +1015,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1044,8 +1050,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming
+	 * data is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
-- 
2.50.1

Attachment: cyclic.sql
Description: application/sql

Reply via email to