AIO / read stream heuristics adjustments for index prefetching

Andres Freund Tue, 31 Mar 2026 09:05:38 -0700

Hi,

The index prefetching patchset [1] contains a few adjustments to the read
stream logic for readahead.  It seemed better to discuss them separately than
in that already very large thread.

The first two patches are also a dependency of the explain read stream
patches [2].

There are two main areas that prefetching of table data during an index scan
is more sensitive to than existing read stream users:

1) Prefetching for index scans is much more sensitive to doing too aggressive
read ahead due to plans that involve running index scans to partial
completion, rather than full completion. Consider e.g. nestloop antijoins
or such, where the scan on the inner side will be started but often not
completed. If we unnecessarily read ahead too aggressively, a lot of IO
could be wasted.

While it's of course possible to have partially consumed read streams with
sequential scans or bitmap heap scans, it's not as common / cost sensitive.

For seqscans it likely mostly happens with a LIMIT above the seqscan, but
that probably won't be happening many times within a query on a table of
any size.

For bitmap heap scans it's not as common because the startup cost, i.e. the
building of the bitmap, is far from cheap, doing that over and over does
not make a lot of sense.

2) Prefetching for index scans is much more likely to have complicated mixes
of hits and misses.

Whereas a seqscan or a bitmap heap scan accesses each table block exactly
once, with index scans its very common to have repeated accesses to some
table blocks, while still having misses on other blocks. This means that
index scans are more sensitive to patterns of hits and misses decreasing
the readahead distance so much that we don't do aggressive enough readahead
to avoid waiting for IO anymore.

While more pronounced with index prefetching, it was already an issue with
the existing users, particularly for bitmap heap scans. In fact, a similar
patch to what's included here was first discussed somewhere around the BHS
prefetching work.

There's a few sets of changes here:

0001+0002: Return whether WaitReadBuffers() needed to wait

The first patch allows pgaio_wref_check_done() to work more reliably with
io_uring. Until now it only was able to return true if userspace already
had consumed the kernel's completion event, but returned false otherwise.
That's not really incorrect, just suboptimal.

The second patch returns whether WaitReadBuffers() needed to wait for
IO. This is useful for a) instrumentation like in [2] and b) to provide
information to the read_stream heuristics to control how aggressive to
perform read ahead.

0003: read_stream: Issue IO synchronously while in fast path

When read stream is in fast path mode (where it short-circuits the read
ahead logic, to reduce CPU overhead in s_b resident workloads) and
encounters a miss, we until now performed the read asynchronously.

Unfortunately, with worker, that can lead to slowdowns, because
dispatching to workers has a latency impact. When doing "real" readahead,
that's a price worth paying, because the latency should be hidden by
issuing the reads early enough. But when just coming out of fast path
mode, we're not ahead of what's needed, so the dispatch latency can't be
hidden.

We already have infrastructure to mark IOs to be executed
synchronously. So we just need to use that here.

0004: read_stream: Prevent distance from decaying too quickly

This, quite simple, patch reduces issue 2) from above, by preventing the
look-ahead distance from being reduced for #maximum lookahead distance
blocks after each miss. While this may seem overly aggressive, a single
effectively synchronous read can take a long time compared to the CPU time
needed for processing pages hits. On cloud storage the IO latency is
somewhere between 0.5ms and 4ms. A halfway modern CPU can do a few
heap_hot_search_buffer()s on 1000s of pages within 1 ms.

While this one is my patch, several others have written variations of it
before. We should probably have committed one already.

There are two minor questions here:
- Should read_stream_pause()/read_stream_resume() restore the "holdoff"
counter? I doubt it matters for the prospective user, since it will
only be used when the lookahead distance is very large.

- For how long to hold off distance reductions? Initially I was torn
between using "max_pinned_buffers" (Min(max_ios * io_combine_limit,
cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I
think the former makes more sense, as we otherwise won't allow for far
enough readahead when doing IO combining, and it does seem to make sense
to hold off decay for long enough that the maximum lookahead could not
theoretically allow us to start an IO.

0005+0006: Only increase distance when waiting for IO

Until now we have increased the read ahead distance whenever there we
needed to do IO (doubling the distance every miss). But that will often be
way too aggressive, with the IO subsystem being able to keep up with a
much lower distance.

The idea here is to use information about whether we needed to wait for IO
before returning the buffer in read_stream_next_buffer() to control
whether we should increase the readahead distance.

This seems to work extremely well for worker.

Unfortuntely with io_uring the situation is more complicated, because
io_uring performs reads synchronously during submission if the data is the
kernel page cache. This can reduce performance substantially compared to
worker, because it prevents parallelizing the copy from the page cache.
There is an existing heuristic for that in method_io_uring.c that adds a
flag to the IO submissions forcing the IO to be processed asynchronously,
allowing for parallelism. Unfortunately the heuristic is triggered by the
number of IOs in flight - which will never become big enough to tgrigger
after using "needed to wait" to control how far to read ahead.

So 0005 expands the io_uring heuristic to also trigger based on the sizes
of IOs - but that's decidedly not perfect, we e.g. have some experiments
showing it regressing some parallel bitmap heap scan cases. It may be
better to somehow tweak the logic to only trigger for worker.

As is this has another issue, which is that it prevents IO combining in
situations where it shouldn't, because right now using the distance to
control both. See 0008 for an attempt at splitting those concerns.

0007: Make read_stream_reset()/end() not wait for IO

This is a quite experimental, not really correct as-is, patch to avoid
unnecessarily waiting for in-flight IO when read_stream_reset() is done
while there's in-flight IO. This is useful for things like nestloop
antioins with quals on the inner side (without the qual we'd not trigger
any readahead, as that's deferred in the index prefetching patch).

As-is this will leave IOs visible in pg_aios for a while, potentially
until the backends exit. That's not right.

0008: WIP: read stream: Split decision about look ahead for AIO and combining

Until now read stream has used a single look-ahead distance to control
lookahead for both IO combining and read-ahead. That's sub-optimal, as we
want to do IO combining even when we don't need to do any readahead, as
avoiding the syscall overhead is important to reduce CPU overhead when
data is in the kernel page cache.

This is a prototype for what it could look like to split those
decisions. Thereby fixing the regression mentioned in 0006.

One thing that's really annoying around this is that we have no infrastructure
for testing that the heuristics keep working. It's very easy to improve one
thing while breaking something else, without noticing, because everything
keeps working.

I'm wondering about something like a READ_STREAM_DEBUG_INSTRUMENT flag which
would trigger providing information about the IOs and their schedule via the
the per-buffer-data mechanism. That would allow test_aio's
read_stream_for_blocks() to return that information, which in turn could be
used to verify that we are doing IO combining and looking ahead far enough in
some situations.

Greetings,

Andres Freund

[1]
https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
[2] https://postgr.es/m/6f541abf-f9e1-4830-93cc-4a849dbf2ecf%40vondra.me

>From b93842e681b69a9bee67dce5a52c131e56d26301 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 16:40:35 -0500
Subject: [PATCH v3 01/10] aio: io_uring: Allow IO methods to check if IO
 completed in the background

Until now pgaio_wref_check_done() with io_method=io_uring would not detect if
IOs are known to have completed to the kernel, but the completion has not yet
been consumed by userspace.  This can lead to inferior performance and also
makes it harder to use smarter feedback logic in read_stream, because we
cannot use knowledge about whether an IO completed to control the readahead
distance.

This commit just adds the io_uring specific infrastructure. Later commits will
return whether a wait was needed from WaitReadBuffers() and then use that
knowledge.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio_internal.h        | 15 ++++++++
 src/backend/storage/aio/aio.c             | 15 ++++++++
 src/backend/storage/aio/method_io_uring.c | 43 +++++++++++++++++++++++
 3 files changed, 73 insertions(+)

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 5feea15be9e..33e1e2dc048 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -328,6 +328,21 @@ typedef struct IoMethodOps
 	 */
 	void		(*wait_one) (PgAioHandle *ioh,
 							 uint64 ref_generation);
+
+	/* ---
+	 * Check if IO has already completed. Optional.
+	 *
+	 * Some IO methods need to poll a kernel object to see if IO has already
+	 * completed in the background. This callback allows to do so.
+	 *
+	 * This callback may not wait for IO to complete, however it is allowed,
+	 * although not desirable, to wait for short-lived locks. It is ok from a
+	 * correctness perspective to not process any/all available completions,
+	 * it just can lead to inferior performance.
+	 * ---
+	 */
+	void		(*check_one) (PgAioHandle *ioh,
+							  uint64 ref_generation);
 } IoMethodOps;
 
 
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e4ae3031fef..4e742038d02 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -1019,6 +1019,21 @@ pgaio_wref_check_done(PgAioWaitRef *iow)
 
 	am_owner = ioh->owner_procno == MyProcNumber;
 
+	/*
+	 * If the IO is not executing synchronously, allow the IO method to check
+	 * if the IO already has completed.
+	 */
+	if (pgaio_method_ops->check_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+	{
+		pgaio_method_ops->check_one(ioh, ref_generation);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return true;
+
+		if (state == PGAIO_HS_IDLE)
+			return true;
+	}
+
 	if (state == PGAIO_HS_COMPLETED_SHARED ||
 		state == PGAIO_HS_COMPLETED_LOCAL)
 	{
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index 4867ded35ea..39984df31b4 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -54,6 +54,7 @@ static void pgaio_uring_shmem_init(bool first_time);
 static void pgaio_uring_init_backend(void);
 static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
 static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+static void pgaio_uring_check_one(PgAioHandle *ioh, uint64 ref_generation);
 
 /* helper functions */
 static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
@@ -75,6 +76,7 @@ const IoMethodOps pgaio_uring_ops = {
 
 	.submit = pgaio_uring_submit,
 	.wait_one = pgaio_uring_wait_one,
+	.check_one = pgaio_uring_check_one,
 };
 
 /*
@@ -658,6 +660,47 @@ pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
 				waited);
 }
 
+static void
+pgaio_uring_check_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+
+	/*
+	 * This check is not reliable when not holding the completion lock, but
+	 * it's a useful cheap pre-check to see if it's worth trying to get the
+	 * completion lock.
+	 */
+	if (!io_uring_cq_ready(&owner_context->io_uring_ring))
+		return;
+
+	/*
+	 * If the completion lock is currently held, the holder will likely
+	 * process any pending completions, give up.
+	 */
+	if (!LWLockConditionalAcquire(&owner_context->completion_lock, LW_EXCLUSIVE))
+		return;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "check_one io_gen: %" PRIu64 ", ref_gen: %" PRIu64,
+				   ioh->generation,
+				   ref_generation);
+
+	/*
+	 * Recheck if there are any completions, another backend could have
+	 * processed them since we checked above, or our unlocked pre-check could
+	 * have been reading outdated values.
+	 *
+	 * It is possible that the IO handle has been reused since the start of
+	 * the call, but now that we have the lock, we can just as well drain all
+	 * completions.
+	 */
+	if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		pgaio_uring_drain_locked(owner_context);
+
+	LWLockRelease(&owner_context->completion_lock);
+}
+
 static void
 pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 {
-- 
2.53.0.1.gb2826b52eb

>From eb8054440ed12cfc65e39054f2a04acf8cc4c10e Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 16:50:50 -0500
Subject: [PATCH v3 02/10] bufmgr: Return whether WaitReadBuffers() needed to
 wait

Thanks to the previous commit, pgaio_wref_check_done() will now detect whether
IO has completed even if userspace has not yet consumed the kernel
completion. To allow read_stream.c to use that knowledge to control readahead
aggressiveness, WaitReadBuffers() needs to return whether it actually needed
to wait.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  2 +-
 src/backend/storage/buffer/bufmgr.c | 17 ++++++++++++++++-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd41b92f944..aa61a39d9e6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -251,7 +251,7 @@ extern bool StartReadBuffers(ReadBuffersOperation *operation,
 							 BlockNumber blockNum,
 							 int *nblocks,
 							 int flags);
-extern void WaitReadBuffers(ReadBuffersOperation *operation);
+extern bool WaitReadBuffers(ReadBuffersOperation *operation);
 
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cd21ae3fc36..af0e0524776 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1740,12 +1740,19 @@ ProcessReadBuffersResult(ReadBuffersOperation *operation)
 	Assert(operation->nblocks_done <= operation->nblocks);
 }
 
-void
+/*
+ * Wait for the IO operation initiated by StartReadBuffers() et al to
+ * complete.
+ *
+ * Returns true if we needed to wait for the IO operation, false otherwise.
+ */
+bool
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
 	PgAioReturn *aio_ret = &operation->io_return;
 	IOContext	io_context;
 	IOObject	io_object;
+	bool		needed_wait = false;
 
 	if (operation->persistence == RELPERSISTENCE_TEMP)
 	{
@@ -1810,6 +1817,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
 
 				pgaio_wref_wait(&operation->io_wref);
+				needed_wait = true;
 
 				/*
 				 * The IO operation itself was already counted earlier, in
@@ -1874,6 +1882,12 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * If the IO completed only partially, we need to perform additional
+		 * work, consider that a form of having had to wait.
+		 */
+		needed_wait = true;
+
 		/*
 		 * This may only complete the IO partially, either because some
 		 * buffers were already valid, or because of a partial read.
@@ -1890,6 +1904,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	CheckReadBuffersOperation(operation, true);
 
 	/* NB: READ_DONE tracepoint was already executed in completion callback */
+	return needed_wait;
 }
 
 /*
-- 
2.53.0.1.gb2826b52eb

>From 0857f07ffd88f6db585aaf83c4ca012c8128f333 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 16:25:41 -0500
Subject: [PATCH v3 03/10] read_stream: Issue IO synchronously while in fast
 path

While in fast-path, execute any IO that we might encounter synchronously.
Because we are, in that moment, not reading ahead, dispatching any occasional
IO to workers has the dispatch overhead, without any realistic chance of the
IO completing before we need it.

This helps io_method=worker performance for workloads that have only
occasional cache misses, but where those occasional misses still take long
enough to matter.  It is likely this is only measurable with fast local
storage or workloads with the data in the kernel page cache, as with remote
storage the IO latency, not the dispatch-to-worker latency, is the determining
factor.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index cd54c1a74ac..7893fdf03d3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -833,6 +833,21 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			if (stream->advice_enabled)
 				flags |= READ_BUFFERS_ISSUE_ADVICE;
 
+			/*
+			 * While in fast-path, execute any IO that we might encounter
+			 * synchronously. Because we are, right now, only looking one
+			 * block ahead, dispatching any occasional IO to workers would
+			 * have the overhead of dispatching to workers, without any
+			 * realistic chance of the IO completing before we need it. We
+			 * will switch to non-synchronous IO after this.
+			 *
+			 * Arguably we should do so only for worker, as there's far less
+			 * dispatch overhead with io_uring. However, tests so far have not
+			 * shown a clear downside and additional io_method awareness here
+			 * seems not great from an abstraction POV.
+			 */
+			flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
 			/*
 			 * Pin a buffer for the next call.  Same buffer entry, and
 			 * arbitrary I/O entry (they're all free).  We don't have to
@@ -860,6 +875,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->ios_in_progress = 1;
 			stream->ios[0].buffer_index = oldest_buffer_index;
 			stream->seq_blocknum = next_blocknum + 1;
+
+			/*
+			 * XXX: It might be worth triggering additional readahead here, to
+			 * avoid having to effectively do another synchronous IO for the
+			 * next block (if it were also a miss).
+			 */
 		}
 		else
 		{
-- 
2.53.0.1.gb2826b52eb

>From 19218f6b86a0e3e2d049d406f26bc23fb5ebb0e5 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sat, 28 Mar 2026 10:51:44 -0400
Subject: [PATCH v3 04/10] read_stream: Prevent distance from decaying too
 quickly

Until now we reduced the look-ahead distance by 1 on every hit, and doubled it
on every miss. That is problematic because there are very common IO patterns
where this prevents us from ever reaching a sufficiently high distance (e.g. a
miss followed by a hit will never have the distance grow beyond 2). In many
such cases, if we had ever reached a sufficient look-ahead distance, things
would have been fine, because we grow the distance faster than we decrease it.

One might think that the most obvious answer to this problem would be to never
reduce the distance. However, that would not work well, as (particularly with
upcoming users of read streams), it is reasonably common to at first have a
lot of misses and then to transition to a fully cached workload, e.g. because
the same blocks are needed repeatedly within one stream. Doing unnecessarily
deep readahead can be costly, due to having to pin a lot more buffers, which
increases CPU overhead.

Because the cost of a synchronously handled miss can be very high (multiple
milliseconds for every IO with commonly used storage) compared to the CPU
overhead of keeping the distance too high, we want to err on the side of not
reducing the distance too early.

The insight that a decrease of the distance by 1 at ever hit may be ok at
large distances, but not at low distances, shows a way out: If we only allow
decreasing the distance once there were no misses for our maximum look-ahead
distance, we will keep the distance high as long as readahead has a chance to
do IO asynchronously, but not commonly when not.

Several folks have written variants of this patch, at least Thomas Munro,
Melanie Plageman and I all have written variants of this.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 36 ++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 7893fdf03d3..fa27ec792c7 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	uint16		distance_decay_holdoff;
 	int16		initialized_buffers;
 	int16		resume_distance;
 	int			read_buffers_flags;
@@ -364,9 +365,22 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
-			stream->distance--;
+		/*
+		 * If there currently is no IO in progress, and we have not needed to
+		 * issue IO recently, decay the look-ahead distance.  We detect if we
+		 * had to issue IO recently by having a decay holdoff that's set to
+		 * the max look-ahead distance whenever we need to do IO.  This is
+		 * important to ensure we eventually reach a high enough distance to
+		 * perform IO asynchronously when starting out with a small look-ahead
+		 * distance.
+		 */
+		if (stream->distance > 1 && stream->ios_in_progress == 0)
+		{
+			if (stream->distance_decay_holdoff == 0)
+				stream->distance--;
+			else
+				stream->distance_decay_holdoff--;
+		}
 	}
 	else
 	{
@@ -702,6 +716,7 @@ read_stream_begin_impl(int flags,
 	stream->seq_blocknum = InvalidBlockNumber;
 	stream->seq_until_processed = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
+	stream->distance_decay_holdoff = 0;

 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -954,6 +969,20 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		distance = Min(distance, stream->max_pinned_buffers);
 		stream->distance = distance;

+		/*
+		 * As we needed IO, prevent distance from being reduced within our
+		 * maximum look-ahead window. This avoids having distance collapse too
+		 * quickly in workloads where most of the required blocks are cached,
+		 * but where the remaining IOs are a sufficient enough factor to cause
+		 * a substantial slowdown if executed synchronously.
+		 *
+		 * There are valid arguments for preventing decay for max_ios or for
+		 * max_pinned_buffers.  But the argument for max_pinned_buffers seems
+		 * clearer - if we can't see any misses within the maximum look-ahead
+		 * distance, we can't do any useful readahead.
+		 */
+		stream->distance_decay_holdoff = stream->max_pinned_buffers;
+
 		/*
 		 * If we've reached the first block of a sequential region we're
 		 * issuing advice for, cancel that until the next jump.  The kernel
@@ -1128,6 +1157,7 @@ read_stream_reset(ReadStream *stream)
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
 	stream->resume_distance = stream->distance;
+	stream->distance_decay_holdoff = 0;
 }

 /*
-- 
2.53.0.1.gb2826b52eb

>From 42073359bd63ded4aaae5c8d0403ed1db31c2e44 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 20:23:55 -0500
Subject: [PATCH v3 05/10] aio: io_uring: Trigger async processing for large
 IOs

io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
once the IO depth. That heuristic is important when doing buffered IO from the
kernel page cache, to allow parallelizing of the memory copy, as otherwise
io_method=io_uring would be a lot slower than io_method=worker in that case.

An upcoming commit will make read_stream.c only increase the readahead
distance if we needed to wait for IO to complete. If to-be-read data is in the
kernel page cache, io_uring will synchronously execute IO, unless the IO is
flagged as async.  Therefore the aforementioned change in read_stream.c
heuristic would lead to a substantial performance regression with io_uring
when data is in the page cache, as we would never reach a deep enough queue to
actually trigger the existing heuristic.

Parallelizing the copy from the page cache is mainly important when doing a
lot of IO, which commonly is only possible when doing largely sequential IO.

The reason we don't just mark all io_uring IOs as asynchronous is that the
dispatch to a kernel thread has overhead. This overhead is mostly noticeable
with small random IOs with a low queue depth, as in that case the gain from
parallelizing the memory copy is small and the latency cost high.

The facts from the two prior paragraphs show a way out: Use the size of the IO
in addition to the depth of the queue to trigger asynchronous processing.

One might think that just using the IO size might be enough, but
experimentation has shown that not to be the case - with deep look-ahead
distances being able to parallelize the memory copy is important even with
smaller IOs.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/method_io_uring.c | 90 +++++++++++++++++------
 1 file changed, 68 insertions(+), 22 deletions(-)

diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index 39984df31b4..0c8fe4598c3 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -409,7 +409,6 @@ static int
 pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
-	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -425,27 +424,6 @@ pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 
 		pgaio_io_prepare_submit(ioh);
 		pgaio_uring_sq_from_io(ioh, sqe);
-
-		/*
-		 * io_uring executes IO in process context if possible. That's
-		 * generally good, as it reduces context switching. When performing a
-		 * lot of buffered IO that means that copying between page cache and
-		 * userspace memory happens in the foreground, as it can't be
-		 * offloaded to DMA hardware as is possible when using direct IO. When
-		 * executing a lot of buffered IO this causes io_uring to be slower
-		 * than worker mode, as worker mode parallelizes the copying. io_uring
-		 * can be told to offload work to worker threads instead.
-		 *
-		 * If an IO is buffered IO and we already have IOs in flight or
-		 * multiple IOs are being submitted, we thus tell io_uring to execute
-		 * the IO in the background. We don't do so for the first few IOs
-		 * being submitted as executing in this process' context has lower
-		 * latency.
-		 */
-		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
-			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
-
-		in_flight_before++;
 	}
 
 	while (true)
@@ -701,10 +679,64 @@ pgaio_uring_check_one(PgAioHandle *ioh, uint64 ref_generation)
 	LWLockRelease(&owner_context->completion_lock);
 }
 
+/*
+ * io_uring executes IO in process context if possible. That's generally good,
+ * as it reduces context switching. When performing a lot of buffered IO that
+ * means that copying between page cache and userspace memory happens in the
+ * foreground, as it can't be offloaded to DMA hardware as is possible when
+ * using direct IO. When executing a lot of buffered IO this causes io_uring
+ * to be slower than worker mode, as worker mode parallelizes the
+ * copying. io_uring can be told to offload work to worker threads instead.
+ *
+ * If the IOs are small, we only benefit from forcing things into the
+ * background if there is a lot of IO, as otherwise the overhead from context
+ * switching is higher than the gain.
+ *
+ * If IOs are large, there is benefit from asynchronous processing at lower
+ * queue depths, as IO latency is less of a crucial factor and parallelizing
+ * memory copies is more important.  In addition, it is important to trigger
+ * asynchronous processing even at low queue depth, as with foreground
+ * processing we might never actually reach deep enough IO depths to trigger
+ * asynchronous processing, which in turn would deprive readahead control
+ * logic of information about whether a deeper look-ahead distance would be
+ * advantageous.
+ *
+ * We have done some basic benchmarking to validate the thresholds used, but
+ * it's quite plausible that there are better values.
+ */
+static bool
+pgaio_uring_should_use_async(PgAioHandle *ioh, size_t io_size)
+{
+	/*
+	 * With DIO there's no benefit from forcing asynchronous processing, as
+	 * io_uring will never process IO completions in the foreground. The
+	 * kernel will use worker threads where appropriate.
+	 */
+	if (!(ioh->flags & PGAIO_HF_BUFFERED))
+		return false;
+
+	/*
+	 * Once the IO queue depth is not that shallow anymore, the overhead of
+	 * dispatching to the background is a less significant factor.
+	 */
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) > 4)
+		return true;
+
+	/*
+	 * If the IO is larger, the gains from parallelizing the memory copy are
+	 * larger and typically the impact of the latency is smaller.
+	 */
+	if (io_size >= (BLCKSZ * 4))
+		return true;
+
+	return false;
+}
+
 static void
 pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 {
 	struct iovec *iov;
+	size_t		io_size = 0;
 
 	switch ((PgAioOp) ioh->op)
 	{
@@ -717,6 +749,8 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 								   iov->iov_base,
 								   iov->iov_len,
 								   ioh->op_data.read.offset);
+
+				io_size = iov->iov_len;
 			}
 			else
 			{
@@ -726,7 +760,13 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 									ioh->op_data.read.iov_length,
 									ioh->op_data.read.offset);
 
+				for (int i = 0; i < ioh->op_data.read.iov_length; i++, iov++)
+					io_size += iov->iov_len;
 			}
+
+			if (pgaio_uring_should_use_async(ioh, io_size))
+				io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
 			break;
 
 		case PGAIO_OP_WRITEV:
@@ -747,6 +787,12 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 									 ioh->op_data.write.iov_length,
 									 ioh->op_data.write.offset);
 			}
+
+			/*
+			 * For now don't trigger use of IOSQE_ASYNC for writes, it's not
+			 * clear there is a performance benefit in doing so.
+			 */
+
 			break;
 
 		case PGAIO_OP_INVALID:
-- 
2.53.0.1.gb2826b52eb

>From b155805cf1a465aef6933980d7e350a78c3fa4c2 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 18:00:53 -0500
Subject: [PATCH v3 06/10] read_stream: Only increase distance when waiting for
 IO

This avoids increasing the distance to the maximum in cases where the IO
subsystem is already keeping up. This turns out to be important for
performance for two reasons:

- Pinning a lot of buffers is not cheap. If additional pins allow us to avoid
  IO waits, it's definitely worth it, but if we can already do all the
  necessary readahead at a distance of 16, reading ahead 512 buffers can
  increase the CPU overhead substantially.  This is particularly noticeable
  when the to-be-read blocks are already in the kernel page cache.

- If the read stream is read to completion, reading in data earlier than
  needed is of limited consequences, leaving aside the CPU costs mentioned
  above. But if the read stream will not be fully consumed, e.g. because it is
  on the inner side of a nested loop join, the additional IO can be a serious
  performance issue. This is not that commonly a problem for current read
  stream users, but the upcoming work, to use a read stream to fetch table
  pages as part of an index scan, frequently encounters this.

Note that this commit would have substantial performance downsides without
earlier commits. In particular the earlier commit to avoid decreasing the
readahead distance when there was recent IO is crucial, as otherwise we very
often would end up not reading ahead aggressively enough anymore with this
commit, due to increasing the distance less often.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 39 +++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index fa27ec792c7..49971833ddc 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -952,22 +952,51 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	{
 		int16		io_index = stream->oldest_io_index;
 		int32		distance;	/* wider temporary value, clamped below */
+		bool		needed_wait;
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
 			   &stream->buffers[oldest_buffer_index]);
 
-		WaitReadBuffers(&stream->ios[io_index].op);
+		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
 
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
+		/*
+		 * If the IO was executed synchronously, we will never see
+		 * WaitReadBuffers() block. This is particularly crucial when
+		 * effective_io_concurrency=0 is used, as all IO will be
+		 * synchronous. Without treating synchronous IO as having waited, we'd
+		 * never allow the distance to get large enough to allow for IO
+		 * combining, resulting in bad performance.
+		 */
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_SYNCHRONOUSLY)
+			needed_wait = true;
+
+		/*
+		 * Have the look-ahead distance ramp up rapidly after we needed to
+		 * wait for IO. We only increase the distance when we needed to wait,
+		 * to avoid increasing the distance further than necessary, as looking
+		 * ahead too far can be costly, both due to the cost of unnecessarily
+		 * pinning many buffers and due to doing IOs that may never be
+		 * consumed if the stream is ended/reset before completion.
+		 *
+		 * If we did not need to wait, the current distance was evidently
+		 * sufficient.
+		 *
+		 * NB: May not increase the distance if we already reached the end of
+		 * the stream, as stream->distance == 0 is used to keep track of
+		 * having reached the end.
+		 */
+		if (stream->distance > 0 && needed_wait)
+		{
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
 
 		/*
 		 * As we needed IO, prevent distance from being reduced within our
-- 
2.53.0.1.gb2826b52eb

>From 0d3f4a13858d35c06e807692bc10a120fbb176c4 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Thu, 19 Mar 2026 23:06:58 -0400
Subject: [PATCH v3 07/10] Hacky implementation of making
 read_stream_reset()/end() not wait for IO

Not waiting for IO during read_stream_reset() can be important for performance
in cases where read streams are frequently reset before the end is
reached. Current users do not commonly do that, but the upcoming work to use a
read stream to prefetch table blocks as part of index scans can do so
frequently in some query patterns. E.g. if there is an index scan on the inner
side of a nested loop.

FIXME: This implementation is problematic though, as there is nothing forcing
the discarded IOs to ever be completed if the backend goes idle. That's at the
very least problematic because it leads to the underlying buffers continuing
to be pinned and the IOs showing up in the pg_aios view.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h             |  1 +
 src/backend/storage/aio/aio.c         | 27 ++++++++++++++++++++++
 src/backend/storage/aio/read_stream.c | 32 ++++++++++++++++++++++-----
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ec543b78409..c184e97a977 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -328,6 +328,7 @@ extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
 extern void pgaio_wref_wait(PgAioWaitRef *iow);
 extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
 
+extern void pgaio_wref_discard_result(PgAioWaitRef *iow);
 
 
 /* --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 4e742038d02..49eb677ad06 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -1055,6 +1055,33 @@ pgaio_wref_check_done(PgAioWaitRef *iow)
 	return false;
 }
 
+void
+pgaio_wref_discard_result(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	bool		am_owner;
+	PgAioHandle *ioh;
+	PgAioHandleState state;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (!am_owner)
+		elog(ERROR, "not you");
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	pgaio_debug_io(DEBUG2, ioh,
+				   "discarding result %p",
+				   ioh->report_return);
+
+	if (ioh->resowner)
+		pgaio_io_release_resowner(&ioh->resowner_node, false);
+}
+
+
 
 
 /* --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 49971833ddc..144b3613c92 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -524,7 +524,7 @@ read_stream_look_ahead(ReadStream *stream)
 		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks >= stream->distance &&
 		  stream->pinned_buffers == 0) ||
-		 stream->distance == 0) &&
+		 stream->distance <= 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream);
 
@@ -534,7 +534,7 @@ read_stream_look_ahead(ReadStream *stream)
 	 * stream.  In the worst case we can always make progress one buffer at a
 	 * time.
 	 */
-	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+	Assert(stream->pinned_buffers > 0 || stream->distance <= 0);
 
 	if (stream->batch_mode)
 		pgaio_exit_batchmode();
@@ -916,7 +916,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
 
 		/* End of stream reached?  */
-		if (stream->distance == 0)
+		if (stream->distance <= 0)
 			return InvalidBuffer;
 
 		/*
@@ -930,7 +930,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		/* End of stream reached? */
 		if (stream->pinned_buffers == 0)
 		{
-			Assert(stream->distance == 0);
+			Assert(stream->distance <= 0);
 			return InvalidBuffer;
 		}
 	}
@@ -958,7 +958,27 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->ios[io_index].op.buffers ==
 			   &stream->buffers[oldest_buffer_index]);
 
-		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
+		/*
+		 * If the stream has been reset, don't even wait for the IO, just
+		 * discard it.
+		 */
+		if (stream->distance < 0)
+		{
+			if (pgaio_wref_valid(&stream->ios[io_index].op.io_wref) &&
+				!stream->ios[io_index].op.foreign_io)
+			{
+				pgaio_wref_discard_result(&stream->ios[io_index].op.io_wref);
+				pgaio_wref_clear(&stream->ios[io_index].op.io_wref);
+			}
+			else
+				WaitReadBuffers(&stream->ios[io_index].op);
+
+			needed_wait = false;
+		}
+		else
+		{
+			needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
+		}
 
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
@@ -1152,7 +1172,7 @@ read_stream_reset(ReadStream *stream)
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
-	stream->distance = 0;
+	stream->distance = -1;
 
 	/* Forget buffered block number and fast path state. */
 	stream->buffered_blocknum = InvalidBlockNumber;
-- 
2.53.0.1.gb2826b52eb

>From ab0e0c198b5299cf015a7c9ef357d8651a776aef Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Mon, 30 Mar 2026 12:25:07 -0400
Subject: [PATCH v3 08/10] WIP: read stream: Split decision about look ahead
 for AIO and combining

Previous commits caused a regression due to the this conflation. This is a
first attempt at fixing the problem.  Needs significant reordering and
splitting if it works out.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 242 +++++++++++++++++++++-----
 1 file changed, 195 insertions(+), 47 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 144b3613c92..1c375edad1d 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -98,10 +98,14 @@ struct ReadStream
 	int16		max_pinned_buffers;
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
-	int16		distance;
+	/* limit of how far to read ahead for IO combining */
+	int16		combine_distance;
+	/* limit of how far to read ahead for starting IO early */
+	int16		readahead_distance;
 	uint16		distance_decay_holdoff;
 	int16		initialized_buffers;
-	int16		resume_distance;
+	int16		resume_readahead_distance;
+	int16		resume_combine_distance;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
 	bool		batch_mode;		/* READ_STREAM_USE_BATCHING */
@@ -332,8 +336,8 @@ read_stream_start_pending_read(ReadStream *stream)
 
 		/* Shrink distance: no more look-ahead until buffers are released. */
 		new_distance = stream->pinned_buffers + buffer_limit;
-		if (stream->distance > new_distance)
-			stream->distance = new_distance;
+		if (stream->readahead_distance > new_distance)
+			stream->readahead_distance = new_distance;
 
 		/* Unless we have nothing to give the consumer, stop here. */
 		if (stream->pinned_buffers > 0)
@@ -374,12 +378,23 @@ read_stream_start_pending_read(ReadStream *stream)
 		 * perform IO asynchronously when starting out with a small look-ahead
 		 * distance.
 		 */
-		if (stream->distance > 1 && stream->ios_in_progress == 0)
+		if (stream->ios_in_progress == 0)
 		{
-			if (stream->distance_decay_holdoff == 0)
-				stream->distance--;
-			else
+			if (stream->distance_decay_holdoff > 0)
 				stream->distance_decay_holdoff--;
+			else
+			{
+				if (stream->readahead_distance > 1)
+					stream->readahead_distance--;
+
+				/*
+				 * XXX: Should we actually reduce this at any time other than
+				 * a reset? For now we have to, as this is also a condition
+				 * for re-enabling fast_path.
+				 */
+				if (stream->combine_distance > 1)
+					stream->combine_distance--;
+			}
 		}
 	}
 	else
@@ -440,6 +455,101 @@ read_stream_start_pending_read(ReadStream *stream)
 	return true;
 }
 
+/*
+ * Should we continue to perform look ahead?  The look ahead may allow us to
+ * make the pending IO larger via IO combining or to issue more read ahead.
+ */
+static bool
+read_stream_should_look_ahead(ReadStream *stream)
+{
+	/* never start more IOs than our cap */
+	if (stream->ios_in_progress >= stream->max_ios)
+		return false;
+
+	/* If the callback has signaled end-of-stream, we're done */
+	if (stream->readahead_distance <= 0)
+		return false;
+
+	/* never pin more buffers than allowed */
+	if (stream->pinned_buffers + stream->pending_read_nblocks >= stream->max_pinned_buffers)
+		return false;
+
+	/*
+	 * Allow looking further ahead if we have an the process of building a
+	 * larger IO, the IO is not yet big enough and we don't yet have IO in
+	 * flight.  Note that this is allowed even if we are reaching the
+	 * readahead limit (but not the buffer pin limit).
+	 *
+	 * This is important for cases where either effective_io_concurrency is
+	 * low or we never need to wait for IO and thus are not increasing the
+	 * distance. Without this we would end up with lots of small IOs.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		stream->pinned_buffers == 0 &&
+		stream->pending_read_nblocks < stream->combine_distance)
+		return true;
+
+	/*
+	 * Don't start more readahead if that'd put us over the limit for doing
+	 * readahead.
+	 */
+	if (stream->pinned_buffers + stream->pending_read_nblocks >= stream->readahead_distance)
+		return false;
+
+	return true;
+}
+
+
+/*
+ * We don't start the pending read just because we've hit the distance limit,
+ * preferring to give it another chance to grow to full io_combine_limit size
+ * once more buffers have been consumed.  But this is not desirable in all
+ * situations - see below.
+ */
+static bool
+read_stream_should_issue_now(ReadStream *stream)
+{
+	int16		pending_read_nblocks = stream->pending_read_nblocks;
+
+	/* no IO to issue */
+	if (pending_read_nblocks == 0)
+		return false;
+
+	/* never start more IOs than our cap */
+	if (stream->ios_in_progress >= stream->max_ios)
+		return false;
+
+	/*
+	 * If the callback has signaled end-of-stream, start the read
+	 * immediately. There's no deferring it for later.
+	 */
+	if (stream->readahead_distance <= 0)
+		return true;
+
+	/*
+	 * If we've already reached io_combine_limit, there's no chance of growing
+	 * the read further.
+	 */
+	if (pending_read_nblocks >= stream->io_combine_limit)
+		return true;
+
+	/* same if capped not by io_combine_limit but combine_distance */
+	if (stream->combine_distance > 0 &&
+		pending_read_nblocks >= stream->combine_distance)
+		return true;
+
+	/*
+	 * If we currently have no reads in flight or prepared, issue the IO once
+	 * we stopped looking ahead. This ensures there's always at least one IO
+	 * prepared.
+	 */
+	if (stream->pinned_buffers == 0 &&
+		!read_stream_should_look_ahead(stream))
+		return true;
+
+	return false;
+}
+
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
@@ -452,14 +562,13 @@ read_stream_look_ahead(ReadStream *stream)
 	if (stream->batch_mode)
 		pgaio_enter_batchmode();
 
-	while (stream->ios_in_progress < stream->max_ios &&
-		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	while (read_stream_should_look_ahead(stream))
 	{
 		BlockNumber blocknum;
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == stream->io_combine_limit)
+		if (read_stream_should_issue_now(stream))
 		{
 			read_stream_start_pending_read(stream);
 			continue;
@@ -479,7 +588,8 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
-			stream->distance = 0;
+			stream->readahead_distance = 0;
+			stream->combine_distance = 0;
 			break;
 		}
 
@@ -511,21 +621,13 @@ read_stream_look_ahead(ReadStream *stream)
 	}
 
 	/*
-	 * We don't start the pending read just because we've hit the distance
-	 * limit, preferring to give it another chance to grow to full
-	 * io_combine_limit size once more buffers have been consumed.  However,
-	 * if we've already reached io_combine_limit, or we've reached the
-	 * distance limit and there isn't anything pinned yet, or the callback has
-	 * signaled end-of-stream, we start the read immediately.  Note that the
-	 * pending read can exceed the distance goal, if the latter was reduced
-	 * after hitting the per-backend buffer limit.
+	 * Check if the pending read should be issued now, or if we should give it
+	 * another chance to grow to the full size.
+	 *
+	 * Note that the pending read can exceed the distance goal, if the latter
+	 * was reduced after hitting the per-backend buffer limit.
 	 */
-	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == stream->io_combine_limit ||
-		 (stream->pending_read_nblocks >= stream->distance &&
-		  stream->pinned_buffers == 0) ||
-		 stream->distance <= 0) &&
-		stream->ios_in_progress < stream->max_ios)
+	if (read_stream_should_issue_now(stream))
 		read_stream_start_pending_read(stream);
 
 	/*
@@ -534,7 +636,7 @@ read_stream_look_ahead(ReadStream *stream)
 	 * stream.  In the worst case we can always make progress one buffer at a
 	 * time.
 	 */
-	Assert(stream->pinned_buffers > 0 || stream->distance <= 0);
+	Assert(stream->pinned_buffers > 0 || stream->readahead_distance <= 0);
 
 	if (stream->batch_mode)
 		pgaio_exit_batchmode();
@@ -724,10 +826,17 @@ read_stream_begin_impl(int flags,
 	 * doing full io_combine_limit sized reads.
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
+	{
+		stream->readahead_distance = Min(max_pinned_buffers, stream->io_combine_limit);
+		stream->combine_distance = stream->io_combine_limit;
+	}
 	else
-		stream->distance = 1;
-	stream->resume_distance = stream->distance;
+	{
+		stream->readahead_distance = 1;
+		stream->combine_distance = 1;
+	}
+	stream->resume_readahead_distance = stream->readahead_distance;
+	stream->resume_combine_distance = stream->combine_distance;
 
 	/*
 	 * Since we always access the same relation, we can initialize parts of
@@ -826,7 +935,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->ios_in_progress == 0);
 		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
-		Assert(stream->distance == 1);
+		Assert(stream->readahead_distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
@@ -900,7 +1009,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
-			stream->distance = 0;
+			stream->readahead_distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
 			stream->buffers[oldest_buffer_index] = InvalidBuffer;
@@ -916,7 +1025,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
 
 		/* End of stream reached?  */
-		if (stream->distance <= 0)
+		if (stream->readahead_distance <= 0)
 			return InvalidBuffer;
 
 		/*
@@ -930,7 +1039,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		/* End of stream reached? */
 		if (stream->pinned_buffers == 0)
 		{
-			Assert(stream->distance <= 0);
+			Assert(stream->readahead_distance <= 0);
 			return InvalidBuffer;
 		}
 	}
@@ -951,7 +1060,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 		bool		needed_wait;
 
 		/* Sanity check that we still agree on the buffers. */
@@ -962,7 +1070,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		 * If the stream has been reset, don't even wait for the IO, just
 		 * discard it.
 		 */
-		if (stream->distance < 0)
+		if (stream->readahead_distance < 0)
 		{
 			if (pgaio_wref_valid(&stream->ios[io_index].op.io_wref) &&
 				!stream->ios[io_index].op.foreign_io)
@@ -1011,11 +1119,38 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		 * the stream, as stream->distance == 0 is used to keep track of
 		 * having reached the end.
 		 */
-		if (stream->distance > 0 && needed_wait)
+		if (stream->readahead_distance > 0 && needed_wait)
 		{
-			distance = stream->distance * 2;
-			distance = Min(distance, stream->max_pinned_buffers);
-			stream->distance = distance;
+			/* wider temporary value, due to oveflow risk */
+			int32		readahead_distance;
+
+			readahead_distance = stream->readahead_distance * 2;
+			readahead_distance = Min(readahead_distance, stream->max_pinned_buffers);
+			stream->readahead_distance = readahead_distance;
+		}
+
+		/*
+		 * Whether we needed to wait or not, allow for more IO combining if we
+		 * needed to do IO. The reason to do so independent of needing to wait
+		 * is that when the data is resident in the kernel page cache, IO
+		 * combining reduces the syscall / dispatch overhead, making it
+		 * worthwhile regardless of needing to wait.
+		 *
+		 * It is also important with io_uring as it will never signal the need
+		 * to wait for reads if all the data is in the page cache. There are
+		 * heuristics to deal with that in method_io_uring.c, but they only
+		 * work when the IO gets large enough.
+		 */
+		if (stream->combine_distance > 0 &&
+			stream->combine_distance < stream->io_combine_limit)
+		{
+			/* wider temporary value, due to oveflow risk */
+			int32		combine_distance;
+
+			combine_distance = stream->combine_distance * 2;
+			combine_distance = Min(combine_distance, stream->io_combine_limit);
+			combine_distance = Min(combine_distance, stream->max_pinned_buffers);
+			stream->combine_distance = combine_distance;
 		}
 
 		/*
@@ -1094,10 +1229,18 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
+	/*
+	 * FIXME: It's way too easy to wrongly fast path. I'm pretty sure there's
+	 * several pre-existing cases where it triggers because we are not issuing
+	 * additional prefetching (e.g. because of a small
+	 * effective_io_concurrency) and thus stream->pinned_buffers stays at 1
+	 * after read_stream_look_ahead().
+	 */
 	if (stream->ios_in_progress == 0 &&
 		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
-		stream->distance == 1 &&
+		stream->readahead_distance == 1 &&
+		stream->combine_distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
 		stream->per_buffer_data_size == 0)
 	{
@@ -1143,8 +1286,9 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 BlockNumber
 read_stream_pause(ReadStream *stream)
 {
-	stream->resume_distance = stream->distance;
-	stream->distance = 0;
+	stream->resume_readahead_distance = stream->readahead_distance;
+	stream->resume_combine_distance = stream->combine_distance;
+	stream->readahead_distance = 0;
 	return InvalidBlockNumber;
 }
 
@@ -1156,7 +1300,8 @@ read_stream_pause(ReadStream *stream)
 void
 read_stream_resume(ReadStream *stream)
 {
-	stream->distance = stream->resume_distance;
+	stream->readahead_distance = stream->resume_readahead_distance;
+	stream->combine_distance = stream->resume_combine_distance;
 }
 
 /*
@@ -1172,7 +1317,8 @@ read_stream_reset(ReadStream *stream)
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
-	stream->distance = -1;
+	stream->readahead_distance = -1;
+	stream->combine_distance = -1;
 
 	/* Forget buffered block number and fast path state. */
 	stream->buffered_blocknum = InvalidBlockNumber;
@@ -1204,8 +1350,10 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->ios_in_progress == 0);
 
 	/* Start off assuming data is cached. */
-	stream->distance = 1;
-	stream->resume_distance = stream->distance;
+	stream->readahead_distance = 1;
+	stream->combine_distance = 1;
+	stream->resume_readahead_distance = stream->readahead_distance;
+	stream->resume_combine_distance = stream->combine_distance;
 	stream->distance_decay_holdoff = 0;
 }
 
-- 
2.53.0.1.gb2826b52eb

AIO / read stream heuristics adjustments for index prefetching

Reply via email to