Re: Don't synchronously wait for already-in-progress IO in read stream

Andres Freund Tue, 17 Mar 2026 10:26:26 -0700

Hi,

> Attached v5 adds some comments to the tests, fixes a few nits in the
> actual code, and adds a commit to fix what I think is an existing
> off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.



> Subject: [PATCH v5 3/5] Fix off-by-one error in read IO tracing
>
> ---
>  src/backend/storage/buffer/bufmgr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c 
> b/src/backend/storage/buffer/bufmgr.c
> index 00bc609529a..0723d4f3dd8 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int 
> *nblocks_progress)
>                * must have started out as a miss in PinBufferForBlock(). The 
> other
>                * backend will track this as a 'read'.
>                */
> -             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + 
> operation->nblocks_done,
> +             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + 
> operation->nblocks_done - 1,
>                                                                               
>   operation->smgr->smgr_rlocator.locator.spcOid,
>                                                                               
>   operation->smgr->smgr_rlocator.locator.dbOid,
>                                                                               
>   operation->smgr->smgr_rlocator.locator.relNumber,
> --

Ah, the issue is that we already incremented nblocks_done, right?  Maybe it'd
be easier to understand if we stashed blocknum + nblocks_done into a local
var, and use it in in both branches of if (!ReadBuffersCanStartIO())?

This probably needs to be backpatched...



> Subject: [PATCH v5 4/5] Make buffer hit helper
>
> Already two places count buffer hits, requiring quite a few lines of
> code since we do accounting in so many places. Future commits will add
> more locations, so refactor into a helper.
>
> Reviewed-by: Nazir Bilal Yavuz <[email protected]>
> Discussion: 
> https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
> ---
>  src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
>  1 file changed, 56 insertions(+), 55 deletions(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c 
> b/src/backend/storage/buffer/bufmgr.c
> index 0723d4f3dd8..399004c2e44 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
>                                                                         bool 
> *foundPtr, IOContext io_context);
>  static bool AsyncReadBuffers(ReadBuffersOperation *operation, int 
> *nblocks_progress);
>  static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool 
> is_complete);
> +
> +static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy 
> strategy,
> +                                                                             
>                           Relation rel, char persistence, SMgrRelation smgr,
> +                                                                             
>                           ForkNumber forknum, BlockNumber blocknum);
>  static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext 
> io_context);
>  static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
>                                                               IOObject 
> io_object, IOContext io_context);
> @@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel,
>                                 bool *foundPtr)
>  {
>       BufferDesc *bufHdr;
> -     IOContext       io_context;
> -     IOObject        io_object;
>
>       Assert(blockNum != P_NEW);
>
> @@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
>                       persistence == RELPERSISTENCE_PERMANENT ||
>                       persistence == RELPERSISTENCE_UNLOGGED));
>
> -     if (persistence == RELPERSISTENCE_TEMP)
> -     {
> -             io_context = IOCONTEXT_NORMAL;
> -             io_object = IOOBJECT_TEMP_RELATION;
> -     }
> -     else
> -     {
> -             io_context = IOContextForStrategy(strategy);
> -             io_object = IOOBJECT_RELATION;
> -     }
> -

I'm mildly worried that this will lead to a bit worse code generation, the
compiler might have a harder time figuring out that io_context/io_object
doesn't change across multiple PinBufferForBlock calls. Although it already
might be unable to do so (we don't mark IOContextForStrategy as
pure [1]).

I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite
direction, and explicitly look up IOContextForStrategy(strategy) *before* the
actual_nblocks loop to make sure the compiler doesn't inject external function
calls (which will in all likelihood require register spilling etc).

I don't think that necessarily has to conflict with the goal of this patch -
most of the the deduplicated stuff isn't io_context, so the helper will be
beneficial even if have to pull out the io_context/io_object determination to
the callsites.


>       TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
>                                                                          
> smgr->smgr_rlocator.locator.spcOid,
>                                                                          
> smgr->smgr_rlocator.locator.dbOid,
> @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
>                                                                          
> smgr->smgr_rlocator.backend);
>
>       if (persistence == RELPERSISTENCE_TEMP)
> -     {
>               bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
> -             if (*foundPtr)
> -                     pgBufferUsage.local_blks_hit++;
> -     }
>       else
> -     {
>               bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
> -                                                      strategy, foundPtr, 
> io_context);
> -             if (*foundPtr)
> -                     pgBufferUsage.shared_blks_hit++;
> -     }
> +                                                      strategy, foundPtr, 
> IOContextForStrategy(strategy));
> +
>       if (rel)
>       {
>               /*

And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
branch in CountBufferHit(), I suspect the compiler may not be able to optimize
it away.

At the very least I'd invert the call to CountBufferHit() and the
pgstat_count_buffer_read(), as the latter will probably prevent most
optimizations (due to the compiler not being able to prove that
(rel)->pgstat_info->counts.blocks_fetched is a different memory location as
*foundPtr).



> +/*
> + * We track various stats related to buffer hits. Because this is done in a
> + * few separate places, this helper exists for convenience.
> + */
> +static pg_attribute_always_inline void
> +CountBufferHit(BufferAccessStrategy strategy,
> +                        Relation rel, char persistence, SMgrRelation smgr,
> +                        ForkNumber forknum, BlockNumber blocknum)
> +{
> +     IOContext       io_context;
> +     IOObject        io_object;
> +
> +     if (persistence == RELPERSISTENCE_TEMP)
> +     {
> +             io_context = IOCONTEXT_NORMAL;
> +             io_object = IOOBJECT_TEMP_RELATION;
> +     }
> +     else
> +     {
> +             io_context = IOContextForStrategy(strategy);
> +             io_object = IOOBJECT_RELATION;
> +     }
> +
> +     TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
> +                                                                       
> blocknum,
> +                                                                       
> smgr->smgr_rlocator.locator.spcOid,
> +                                                                       
> smgr->smgr_rlocator.locator.dbOid,
> +                                                                       
> smgr->smgr_rlocator.locator.relNumber,
> +                                                                       
> smgr->smgr_rlocator.backend,
> +                                                                       true);
> +
> +     if (persistence == RELPERSISTENCE_TEMP)
> +             pgBufferUsage.local_blks_hit += 1;
> +     else
> +             pgBufferUsage.shared_blks_hit += 1;
> +
> +     if (rel)
> +             pgstat_count_buffer_hit(rel);
> +
> +     pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
> +
> +     if (VacuumCostActive)
> +             VacuumCostBalance += VacuumCostPageHit;
> +}

I don't think "Count*" is a great name for something that does tracepoints and
vacuum cost balance accounting, the latter actually changes behavior of the
program due to the sleeps it injects.

The first alternative I have is AccountForBufferHit(), not great, but still
seems a bit better.



> From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Fri, 23 Jan 2026 14:00:31 -0500
> Subject: [PATCH v5 5/5] Don't wait for already in-progress IO
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> When a backend attempts to start a read on a buffer and finds that I/O
> is already in progress, it previously waited for that I/O to complete
> before initiating reads for any other buffers. Although the backend must
> still wait for the I/O to finish when later acquiring the buffer, it
> should not need to wait at read start time. Other buffers may be
> available for I/O, and in some workloads this waiting significantly
> reduces concurrency.
>
> For example, index scans may repeatedly request the same heap block. If
> the backend waits each time it encounters an in-progress read, the
> access pattern effectively degenerates into synchronous I/O. By
> introducing the concept of foreign I/O operations, a backend can record
> the buffer’s wait reference and defer waiting until WaitReadBuffers()
> when it actually acquires the buffer.
>
> In rare cases, a backend may still need to wait when starting a read if
> it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
> but before the buffer descriptor’s wait reference has been set. Such
> windows should be brief and uncommon.
>
> Author: Melanie Plageman <[email protected]>
> Reviewed-by: Andres Freund <[email protected]>
> Reviewed-by: Nazir Bilal Yavuz <[email protected]>
> Discussion: 
> https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv

> +/*
> + * In AsyncReadBuffers(), when preparing a buffer for reading and setting
> + * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
> + * already contain the desired block. AsyncReadBuffers() must distinguish
> + * between these cases (and the case where it should initiate I/O) so it can
> + * mark an in-progress buffer as foreign I/O rather than waiting on it.
> + */
> +typedef enum PrepareReadBuffer_Status
> +{
> +     READ_BUFFER_ALREADY_DONE,
> +     READ_BUFFER_IN_PROGRESS,
> +     READ_BUFFER_READY_FOR_IO,
> +} PrepareReadBuffer_Status;

I don't personally like mixing underscore and camel case naming within one
name.

I wonder if might be worth splitting this up in a refactoring and a
"behavioural change" commit. Might be too complicated.

Candidates for a split seem to be:
- Moving pgaio_io_acquire_nb() to earlier
- Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support
for READ_BUFFER_IN_PROGRESS
- introduce READ_BUFFER_IN_PROGRESS


>  /*
>   * We track various stats related to buffer hits. Because this is done in a
>   * few separate places, this helper exists for convenience.
> @@ -1815,8 +1791,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>                        * b) reports some time as waiting, even if we never 
> waited
>                        *
>                        * we first check if we already know the IO is complete.
> +                      *
> +                      * Note that operation->io_return is uninitialized for 
> foreign IO,
> +                      * so we cannot count that wait time.
>                        */

I'm confused - your comment says we can't count wait time with a foreign IO,
but then oes on to count foreign IO time?  The lack of io_return just means we
can't do  the cheaper pre-check for PGAIO_RS_UNKNOWN, no?


> -                     if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
> +                     if ((operation->foreign_io || aio_ret->result.status == 
> PGAIO_RS_UNKNOWN) &&
>                               !pgaio_wref_check_done(&operation->io_wref))
>                       {
>                               instr_time      io_start = 
> pgstat_prepare_io_time(track_io_timing);
> @@ -1835,11 +1814,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>                               
> Assert(pgaio_wref_check_done(&operation->io_wref));
>                       }
>
> -                     /*
> -                      * We now are sure the IO completed. Check the results. 
> This
> -                      * includes reporting on errors if there were any.
> -                      */
> -                     ProcessReadBuffersResult(operation);
> +                     if (unlikely(operation->foreign_io))
> +                     {
> +                             Buffer          buffer = 
> operation->buffers[operation->nblocks_done];
> +                             BufferDesc *desc = BufferIsLocal(buffer) ?
> +                                     GetLocalBufferDescriptor(-buffer - 1) :
> +                                     GetBufferDescriptor(buffer - 1);
> +                             uint32          buf_state = 
> pg_atomic_read_u64(&desc->state);
> +
> +                             if (buf_state & BM_VALID)
> +                             {
> +                                     operation->nblocks_done += 1;
> +                                     Assert(operation->nblocks_done <= 
> operation->nblocks);
> +
> +                                     CountBufferHit(operation->strategy,
> +                                                                
> operation->rel, operation->persistence,
> +                                                                
> operation->smgr, operation->forknum,
> +                                                                
> operation->blocknum + operation->nblocks_done - 1);

Probably worth including a comment explaining why we count this as a hit. IIRC
earlier versions had such a comment.


> +/*
> + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
> + * avoid an external function call.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
> +                                                     Buffer buffer)

Hm, seems the test in 0002 should be extended to cover the the temp table case.



> +{
> +     BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
> +     uint64          buf_state = pg_atomic_read_u64(&desc->state);
> +
> +     /* Already valid, no work to do */
> +     if (buf_state & BM_VALID)
> +     {
> +             pgaio_wref_clear(&operation->io_wref);
> +             return READ_BUFFER_ALREADY_DONE;
> +     }

Is this reachable for local buffers?


> +     pgaio_submit_staged();
> +
> +     if (pgaio_wref_valid(&desc->io_wref))
> +     {
> +             operation->io_wref = desc->io_wref;
> +             operation->foreign_io = true;
> +             return READ_BUFFER_IN_PROGRESS;
> +     }
> +
> +     /*
> +      * While it is possible for a buffer to have been prepared for IO but 
> not
> +      * yet had its wait reference set, there's no way for us to know that 
> for
> +      * temporary buffers. Thus, we'll prepare for own IO on this buffer.
> +      */
> +     return READ_BUFFER_READY_FOR_IO;

Is that actually possible? And would it be ok to just do start IO in that
case?


> +/*
> + * Try to start IO on the first buffer in a new run of blocks. If AIO is in
> + * progress, be it in this backend or another backend, we just associate the
> + * wait reference with the operation and wait in WaitReadBuffers(). This 
> turns
> + * out to be important for performance in two workloads:
> + *
> + * 1) A read stream that has to read the same block multiple times within the
> + *    readahead distance. This can happen e.g. for the table accesses of an
> + *    index scan.
> + *
> + * 2) Concurrent scans by multiple backends on the same relation.
> + *
> + * If we were to synchronously wait for the in-progress IO, we'd not be able
> + * to keep enough I/O in flight.
> + *
> + * If we do find there is ongoing I/O for the buffer, we set up a 1-block
> + * ReadBuffersOperation that WaitReadBuffers then can wait on.
> + *
> + * It's possible that another backend has started IO on the buffer but not 
> yet
> + * set its wait reference. In this case, we have no choice but to wait for
> + * either the wait reference to be valid or the IO to be done.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewReadBufferIO(ReadBuffersOperation *operation,
> +                                        Buffer buffer)
> +{

I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
"Continue"? Or "First" & "Additional"?  Or ...


> +     uint64          buf_state;
> +     BufferDesc *desc;
> +
> +     if (BufferIsLocal(buffer))
> +             return PrepareNewLocalReadBufferIO(operation, buffer);
> +
> +     ResourceOwnerEnlarge(CurrentResourceOwner);
> +     desc = GetBufferDescriptor(buffer - 1);
> +
> +     for (;;)
> +     {
> +             buf_state = LockBufHdr(desc);

Perhaps worth adding an
   Assert(buf_state & BM_TAG_VALID)?


> +             /* Already valid, no work to do */
> +             if (buf_state & BM_VALID)
> +             {
> +                     UnlockBufHdr(desc);
> +                     pgaio_wref_clear(&operation->io_wref);

Perhaps we should clear &operation->io_wref once at the start? Because right
now it'll be cleared if BM_VALID and it'll be set if BM_IO_IN_PROGRESS &&
&desc->io_wref, but it won't be touched when in BM_IO_IN_PROGRESS without a
wref set.  It seems either we should just touch &operation->io_wref if
  BM_IO_IN_PROGRESS && pgaio_wref_valid(&desc->io_wref)
or we should reliably do it.



> +                     return READ_BUFFER_ALREADY_DONE;
> +             }
> +
> +             if (buf_state & BM_IO_IN_PROGRESS)
> +             {
> +                     /* Join existing read */
> +                     if (pgaio_wref_valid(&desc->io_wref))
> +                     {
> +                             operation->io_wref = desc->io_wref;
> +                             operation->foreign_io = true;
> +                             UnlockBufHdr(desc);
> +                             return READ_BUFFER_IN_PROGRESS;
> +                     }
> +
> +                     /*
> +                      * If the wait ref is not valid but the IO is in 
> progress, someone
> +                      * else started IO but hasn't set the wait ref yet. We 
> have no
> +                      * choice but to wait until the IO completes.
> +                      */
> +                     UnlockBufHdr(desc);
> +                     pgaio_submit_staged();
> +                     WaitIO(desc);
> +                     continue;

Before this commit there was an explanation for this submit:

-    /*
-     * If this backend currently has staged IO, we need to submit the pending
-     * IO before waiting for the right to issue IO, to avoid the potential for
-     * deadlocks (and, more commonly, unnecessary delays for other backends).
-     */

Seems that vanished?



> +/*
> + * When building a new IO from multiple buffers, we won't include buffers
> + * that are already valid or already in progress. This function should only 
> be
> + * used for additional adjacent buffers following the head buffer in a new 
> IO.
> + *
> + * Returns true if the buffer was successfully prepared for IO and false if 
> it
> + * is rejected and the read IO should not include this buffer.
> + */
> +static bool
> +PrepareAdditionalReadBuffer(Buffer buffer)

I think it'd be good to mention that this may never wait for IO or such to
avoid deadlocks.



> +     /* Check if we can start IO on the first to-be-read buffer */
> +     if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) 
> <
> +             READ_BUFFER_READY_FOR_IO)
> +     {

I don't love this < bit. For one there's no mention in
PrepareReadBuffer_Status mentioning that the numerical order is important. Any
reason to not just test != READ_BUFFER_READY_FOR_IO?

The assignment inside the if also looks somewhat awkward. For while() loops
there's often not really a better way to write it, but here you could just as
well do the status assignment in a line before.


> +             pgaio_io_release(ioh);
> +             *nblocks_progress = 1;
> +             if (status == READ_BUFFER_ALREADY_DONE)
> +             {
> +                     /*
> +                      * Someone else has already completed this block, we're 
> done.
> +                      *
> +                      * When IO is necessary, ->nblocks_done is updated in
> +                      * ProcessReadBuffersResult(), but that is not called 
> if no IO is
> +                      * necessary. Thus update here.
> +                      */
> +                     operation->nblocks_done += 1;
> +                     Assert(operation->nblocks_done <= operation->nblocks);
> +
> +                     /*
> +                      * Report and track this as a 'hit' for this backend, 
> even though
> +                      * it must have started out as a miss in 
> PinBufferForBlock(). The
> +                      * other backend will track this as a 'read'.
> +                      */
> +                     CountBufferHit(operation->strategy,
> +                                                operation->rel, 
> operation->persistence,
> +                                                operation->smgr, 
> operation->forknum,
> +                                                operation->blocknum + 
> operation->nblocks_done - 1);
> +                     return false;
> +             }
> +
> +             /* The IO is already in-progress */
> +             Assert(status == READ_BUFFER_IN_PROGRESS);
> +             CheckReadBuffersOperation(operation, false);

I was about to suggest that there should be a CheckReadBuffersOperation() for
both returns here, but there already are CheckReadBuffersOperation after calls
to AsyncReadBuffers(), so I think this CheckReadBuffersOperation could just be
removed.



>       /*
> -      * Check if we can start IO on the first to-be-read buffer.
> -      *
> -      * If an I/O is already in progress in another backend, we want to wait
> -      * for the outcome: either done, or something went wrong and we will
> -      * retry.
> +      * How many neighboring-on-disk blocks can we scatter-read into other
> +      * buffers at the same time?  In this case we don't wait if we see an 
> I/O
> +      * already in progress.  We already set BM_IO_IN_PROGRESS for the head
> +      * block, so we should get on with that I/O as soon as possible.
>        */
> -     if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
> +     for (int i = nblocks_done + 1; i < operation->nblocks; i++)
>       {
> -             /*
> -              * Someone else has already completed this block, we're done.
> -              *
> -              * When IO is necessary, ->nblocks_done is updated in
> -              * ProcessReadBuffersResult(), but that is not called if no IO 
> is
> -              * necessary. Thus update here.
> -              */
> -             operation->nblocks_done += 1;
> -             *nblocks_progress = 1;
> -
> -             pgaio_io_release(ioh);
> -             pgaio_wref_clear(&operation->io_wref);
> -             did_start_io = false;
> +             if (!PrepareAdditionalReadBuffer(buffers[i]))
> +                     break;
> +             /* Must be consecutive block numbers. */
> +             Assert(BufferGetBlockNumber(buffers[i - 1]) ==
> +                        BufferGetBlockNumber(buffers[i]) - 1);

Seems this assert could stand to be before the PrepareAdditionalReadBuffer(),
as it better hold true before we try to BM_IO_IN_PROGRESS?

I realize this is old, but since you're whacking this around...



> diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
> index 4017896f951..f85a9acc6ac 100644
> --- a/src/include/storage/bufmgr.h
> +++ b/src/include/storage/bufmgr.h
> @@ -147,6 +147,8 @@ struct ReadBuffersOperation
>       int                     flags;
>       int16           nblocks;
>       int16           nblocks_done;
> +     /* true if waiting on another backend's IO */
> +     bool            foreign_io;
>       PgAioWaitRef io_wref;
>       PgAioReturn io_return;
>  };

This adds an alignment-padding hole between nblocks_done and io_wref.  Read
stream can allocate quite a few of these, so it's probably worth avoiding?

There's a padding hole between persistence and forknum, but that seems a bit
ugly to utilize. A bit less ugly if we swapped forknum and persistence.

Or we could make 'flags' a uint8/16 (flags should imo always be unsigned, and
there are just four flag bits right now).

But perhaps it's also not worth worrying about right now.


[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Attributes.html#index-pure

Greetings,

Andres Freund

Re: Don't synchronously wait for already-in-progress IO in read stream

Reply via email to