Hi, I was wondering why the "simple" approach performs so much worse than the "complex" one on some of the data sets. The theory was that it's due to using read_stream_reset(), which resets the prefetch distance, and so we need to "ramp up" from scratch (distance=1) for every batch. Which for the correlated data sets is very often.
So I decided to do some experiments, to see if this is really the case, and maybe see if read_stream_reset() could fix this in some way. First, I added an elog(LOG, "distance %d", stream->distance); at the beginning of read_stream_next_block() to see how the distance changes during the scan. Consider a query returning 2M rows from the "cyclic" table (the attached .sql creates/pupulates it): -- selects 20% rows SELECT * FROM cyclic WHERE a BETWEEN 0 AND 20000; With the "complex" patch, the CDF of the distance looks like this: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 25 | 0 | | 50 | 0 | | 75 | 0 | | 100 | 0 | | 125 | 0 | | 150 | 0 | | 175 | 0 | | 200 | 0 | | 225 | 0 | | 250 | 0 | | 275 | 99 | | 300 | 99 | +----------+-----+ That is, 99% of the distances is in the range [275, 300]. Note: This is much higher than the effective_io_concurrency value (16), which may be surprising. But the ReadStream uses that to limit the number of I/O requests, not as a limit of how far to look ahead. A lot of the blocks are in the cache, so it looks far ahead. But with the "simple" patch it looks like this: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 25 | 99 | | 50 | 99 | | 75 | 99 | | 100 | 99 | | 125 | 99 | | 150 | 99 | | 175 | 99 | | 200 | 99 | | 225 | 99 | | 250 | 99 | | 275 | 100 | | 300 | 100 | +----------+-----+ So 99% of the distances is in [0, 25]. A more detailed view on the first couple distances: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 1 | 99 | | 2 | 99 | | 3 | 99 | | 4 | 99 | ... So 99% of distances is 1. Well, that's not very far, it effectively means no prefetching (We still issue the fadvise, though, although a comment in read_stream.c suggests we won't. Possible bug?). This means *there's no ramp-up at all*. On the first leaf the distance grows to ~270, but after the stream gets reset it stays at 1 and never increases. That's ... not great? I'm not entirely sure I decided to hack the ReadStream a bit, so that it restores the last non-zero distance seen (i.e. right before reaching end of the stream). And with that I got this: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 25 | 38 | | 50 | 38 | | 75 | 38 | | 100 | 39 | | 125 | 42 | | 150 | 47 | | 175 | 47 | | 200 | 48 | | 225 | 49 | | 250 | 50 | | 275 | 100 | | 300 | 100 | +----------+-----+ Not as good as the "complex" patch, but much better than the original. And the performance got almost the same (for this one query). Perhaps the ReadStream should do something like this? Of course, the simple patch resets the stream very often, likely mcuh more often than anything else in the code. But wouldn't it be beneficial for streams reset because of a rescan? Possibly needs to be optional. regards -- Tomas Vondra
From 817d5a7f7d19d0d9fe75ba1bb00a8164a438021a Mon Sep 17 00:00:00 2001 From: Tomas Vondra <to...@vondra.me> Date: Fri, 18 Jul 2025 16:41:03 +0200 Subject: [PATCH] read_stream: restore prefetch distance after reset --- src/backend/storage/aio/read_stream.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 0e7f5557f5c..796791f8427 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -99,6 +99,7 @@ struct ReadStream int16 forwarded_buffers; int16 pinned_buffers; int16 distance; + int16 distance_old; int16 initialized_buffers; int read_buffers_flags; bool sync_mode; /* using io_method=sync */ @@ -443,6 +444,7 @@ read_stream_look_ahead(ReadStream *stream) if (blocknum == InvalidBlockNumber) { /* End of stream. */ + stream->distance_old = stream->distance; stream->distance = 0; break; } @@ -841,6 +843,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) else { /* No more blocks, end of stream. */ + stream->distance_old = stream->distance; stream->distance = 0; stream->oldest_buffer_index = stream->next_buffer_index; stream->pinned_buffers = 0; @@ -1012,6 +1015,9 @@ read_stream_reset(ReadStream *stream) int16 index; Buffer buffer; + /* remember the old distance (if we reset before end of the stream) */ + stream->distance_old = Max(stream->distance, stream->distance_old); + /* Stop looking ahead. */ stream->distance = 0; @@ -1044,8 +1050,12 @@ read_stream_reset(ReadStream *stream) Assert(stream->pinned_buffers == 0); Assert(stream->ios_in_progress == 0); - /* Start off assuming data is cached. */ - stream->distance = 1; + /* + * Restore the old distance, if we have one. Otherwise start assuming + * data is cached. + */ + stream->distance = Max(1, stream->distance_old); + stream->distance_old = 0; } /* -- 2.50.1
cyclic.sql
Description: application/sql