Hi, > Attached v5 adds some comments to the tests, fixes a few nits in the > actual code, and adds a commit to fix what I think is an existing > off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.
> Subject: [PATCH v5 3/5] Fix off-by-one error in read IO tracing > > --- > src/backend/storage/buffer/bufmgr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 00bc609529a..0723d4f3dd8 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int > *nblocks_progress) > * must have started out as a miss in PinBufferForBlock(). The > other > * backend will track this as a 'read'. > */ > - TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + > operation->nblocks_done, > + TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + > operation->nblocks_done - 1, > > operation->smgr->smgr_rlocator.locator.spcOid, > > operation->smgr->smgr_rlocator.locator.dbOid, > > operation->smgr->smgr_rlocator.locator.relNumber, > -- Ah, the issue is that we already incremented nblocks_done, right? Maybe it'd be easier to understand if we stashed blocknum + nblocks_done into a local var, and use it in in both branches of if (!ReadBuffersCanStartIO())? This probably needs to be backpatched... > Subject: [PATCH v5 4/5] Make buffer hit helper > > Already two places count buffer hits, requiring quite a few lines of > code since we do accounting in so many places. Future commits will add > more locations, so refactor into a helper. > > Reviewed-by: Nazir Bilal Yavuz <[email protected]> > Discussion: > https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv > --- > src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++-------------- > 1 file changed, 56 insertions(+), 55 deletions(-) > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 0723d4f3dd8..399004c2e44 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr, > bool > *foundPtr, IOContext io_context); > static bool AsyncReadBuffers(ReadBuffersOperation *operation, int > *nblocks_progress); > static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool > is_complete); > + > +static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy > strategy, > + > Relation rel, char persistence, SMgrRelation smgr, > + > ForkNumber forknum, BlockNumber blocknum); > static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext > io_context); > static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln, > IOObject > io_object, IOContext io_context); > @@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel, > bool *foundPtr) > { > BufferDesc *bufHdr; > - IOContext io_context; > - IOObject io_object; > > Assert(blockNum != P_NEW); > > @@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel, > persistence == RELPERSISTENCE_PERMANENT || > persistence == RELPERSISTENCE_UNLOGGED)); > > - if (persistence == RELPERSISTENCE_TEMP) > - { > - io_context = IOCONTEXT_NORMAL; > - io_object = IOOBJECT_TEMP_RELATION; > - } > - else > - { > - io_context = IOContextForStrategy(strategy); > - io_object = IOOBJECT_RELATION; > - } > - I'm mildly worried that this will lead to a bit worse code generation, the compiler might have a harder time figuring out that io_context/io_object doesn't change across multiple PinBufferForBlock calls. Although it already might be unable to do so (we don't mark IOContextForStrategy as pure [1]). I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite direction, and explicitly look up IOContextForStrategy(strategy) *before* the actual_nblocks loop to make sure the compiler doesn't inject external function calls (which will in all likelihood require register spilling etc). I don't think that necessarily has to conflict with the goal of this patch - most of the the deduplicated stuff isn't io_context, so the helper will be beneficial even if have to pull out the io_context/io_object determination to the callsites. > TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum, > > smgr->smgr_rlocator.locator.spcOid, > > smgr->smgr_rlocator.locator.dbOid, > @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel, > > smgr->smgr_rlocator.backend); > > if (persistence == RELPERSISTENCE_TEMP) > - { > bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr); > - if (*foundPtr) > - pgBufferUsage.local_blks_hit++; > - } > else > - { > bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum, > - strategy, foundPtr, > io_context); > - if (*foundPtr) > - pgBufferUsage.shared_blks_hit++; > - } > + strategy, foundPtr, > IOContextForStrategy(strategy)); > + > if (rel) > { > /* And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP branch in CountBufferHit(), I suspect the compiler may not be able to optimize it away. At the very least I'd invert the call to CountBufferHit() and the pgstat_count_buffer_read(), as the latter will probably prevent most optimizations (due to the compiler not being able to prove that (rel)->pgstat_info->counts.blocks_fetched is a different memory location as *foundPtr). > +/* > + * We track various stats related to buffer hits. Because this is done in a > + * few separate places, this helper exists for convenience. > + */ > +static pg_attribute_always_inline void > +CountBufferHit(BufferAccessStrategy strategy, > + Relation rel, char persistence, SMgrRelation smgr, > + ForkNumber forknum, BlockNumber blocknum) > +{ > + IOContext io_context; > + IOObject io_object; > + > + if (persistence == RELPERSISTENCE_TEMP) > + { > + io_context = IOCONTEXT_NORMAL; > + io_object = IOOBJECT_TEMP_RELATION; > + } > + else > + { > + io_context = IOContextForStrategy(strategy); > + io_object = IOOBJECT_RELATION; > + } > + > + TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, > + > blocknum, > + > smgr->smgr_rlocator.locator.spcOid, > + > smgr->smgr_rlocator.locator.dbOid, > + > smgr->smgr_rlocator.locator.relNumber, > + > smgr->smgr_rlocator.backend, > + true); > + > + if (persistence == RELPERSISTENCE_TEMP) > + pgBufferUsage.local_blks_hit += 1; > + else > + pgBufferUsage.shared_blks_hit += 1; > + > + if (rel) > + pgstat_count_buffer_hit(rel); > + > + pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0); > + > + if (VacuumCostActive) > + VacuumCostBalance += VacuumCostPageHit; > +} I don't think "Count*" is a great name for something that does tracepoints and vacuum cost balance accounting, the latter actually changes behavior of the program due to the sleeps it injects. The first alternative I have is AccountForBufferHit(), not great, but still seems a bit better. > From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <[email protected]> > Date: Fri, 23 Jan 2026 14:00:31 -0500 > Subject: [PATCH v5 5/5] Don't wait for already in-progress IO > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > When a backend attempts to start a read on a buffer and finds that I/O > is already in progress, it previously waited for that I/O to complete > before initiating reads for any other buffers. Although the backend must > still wait for the I/O to finish when later acquiring the buffer, it > should not need to wait at read start time. Other buffers may be > available for I/O, and in some workloads this waiting significantly > reduces concurrency. > > For example, index scans may repeatedly request the same heap block. If > the backend waits each time it encounters an in-progress read, the > access pattern effectively degenerates into synchronous I/O. By > introducing the concept of foreign I/O operations, a backend can record > the buffer’s wait reference and defer waiting until WaitReadBuffers() > when it actually acquires the buffer. > > In rare cases, a backend may still need to wait when starting a read if > it encounters a buffer after another backend has set BM_IO_IN_PROGRESS > but before the buffer descriptor’s wait reference has been set. Such > windows should be brief and uncommon. > > Author: Melanie Plageman <[email protected]> > Reviewed-by: Andres Freund <[email protected]> > Reviewed-by: Nazir Bilal Yavuz <[email protected]> > Discussion: > https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv > +/* > + * In AsyncReadBuffers(), when preparing a buffer for reading and setting > + * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may > + * already contain the desired block. AsyncReadBuffers() must distinguish > + * between these cases (and the case where it should initiate I/O) so it can > + * mark an in-progress buffer as foreign I/O rather than waiting on it. > + */ > +typedef enum PrepareReadBuffer_Status > +{ > + READ_BUFFER_ALREADY_DONE, > + READ_BUFFER_IN_PROGRESS, > + READ_BUFFER_READY_FOR_IO, > +} PrepareReadBuffer_Status; I don't personally like mixing underscore and camel case naming within one name. I wonder if might be worth splitting this up in a refactoring and a "behavioural change" commit. Might be too complicated. Candidates for a split seem to be: - Moving pgaio_io_acquire_nb() to earlier - Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support for READ_BUFFER_IN_PROGRESS - introduce READ_BUFFER_IN_PROGRESS > /* > * We track various stats related to buffer hits. Because this is done in a > * few separate places, this helper exists for convenience. > @@ -1815,8 +1791,11 @@ WaitReadBuffers(ReadBuffersOperation *operation) > * b) reports some time as waiting, even if we never > waited > * > * we first check if we already know the IO is complete. > + * > + * Note that operation->io_return is uninitialized for > foreign IO, > + * so we cannot count that wait time. > */ I'm confused - your comment says we can't count wait time with a foreign IO, but then oes on to count foreign IO time? The lack of io_return just means we can't do the cheaper pre-check for PGAIO_RS_UNKNOWN, no? > - if (aio_ret->result.status == PGAIO_RS_UNKNOWN && > + if ((operation->foreign_io || aio_ret->result.status == > PGAIO_RS_UNKNOWN) && > !pgaio_wref_check_done(&operation->io_wref)) > { > instr_time io_start = > pgstat_prepare_io_time(track_io_timing); > @@ -1835,11 +1814,33 @@ WaitReadBuffers(ReadBuffersOperation *operation) > > Assert(pgaio_wref_check_done(&operation->io_wref)); > } > > - /* > - * We now are sure the IO completed. Check the results. > This > - * includes reporting on errors if there were any. > - */ > - ProcessReadBuffersResult(operation); > + if (unlikely(operation->foreign_io)) > + { > + Buffer buffer = > operation->buffers[operation->nblocks_done]; > + BufferDesc *desc = BufferIsLocal(buffer) ? > + GetLocalBufferDescriptor(-buffer - 1) : > + GetBufferDescriptor(buffer - 1); > + uint32 buf_state = > pg_atomic_read_u64(&desc->state); > + > + if (buf_state & BM_VALID) > + { > + operation->nblocks_done += 1; > + Assert(operation->nblocks_done <= > operation->nblocks); > + > + CountBufferHit(operation->strategy, > + > operation->rel, operation->persistence, > + > operation->smgr, operation->forknum, > + > operation->blocknum + operation->nblocks_done - 1); Probably worth including a comment explaining why we count this as a hit. IIRC earlier versions had such a comment. > +/* > + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to > + * avoid an external function call. > + */ > +static PrepareReadBuffer_Status > +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation, > + Buffer buffer) Hm, seems the test in 0002 should be extended to cover the the temp table case. > +{ > + BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1); > + uint64 buf_state = pg_atomic_read_u64(&desc->state); > + > + /* Already valid, no work to do */ > + if (buf_state & BM_VALID) > + { > + pgaio_wref_clear(&operation->io_wref); > + return READ_BUFFER_ALREADY_DONE; > + } Is this reachable for local buffers? > + pgaio_submit_staged(); > + > + if (pgaio_wref_valid(&desc->io_wref)) > + { > + operation->io_wref = desc->io_wref; > + operation->foreign_io = true; > + return READ_BUFFER_IN_PROGRESS; > + } > + > + /* > + * While it is possible for a buffer to have been prepared for IO but > not > + * yet had its wait reference set, there's no way for us to know that > for > + * temporary buffers. Thus, we'll prepare for own IO on this buffer. > + */ > + return READ_BUFFER_READY_FOR_IO; Is that actually possible? And would it be ok to just do start IO in that case? > +/* > + * Try to start IO on the first buffer in a new run of blocks. If AIO is in > + * progress, be it in this backend or another backend, we just associate the > + * wait reference with the operation and wait in WaitReadBuffers(). This > turns > + * out to be important for performance in two workloads: > + * > + * 1) A read stream that has to read the same block multiple times within the > + * readahead distance. This can happen e.g. for the table accesses of an > + * index scan. > + * > + * 2) Concurrent scans by multiple backends on the same relation. > + * > + * If we were to synchronously wait for the in-progress IO, we'd not be able > + * to keep enough I/O in flight. > + * > + * If we do find there is ongoing I/O for the buffer, we set up a 1-block > + * ReadBuffersOperation that WaitReadBuffers then can wait on. > + * > + * It's possible that another backend has started IO on the buffer but not > yet > + * set its wait reference. In this case, we have no choice but to wait for > + * either the wait reference to be valid or the IO to be done. > + */ > +static PrepareReadBuffer_Status > +PrepareNewReadBufferIO(ReadBuffersOperation *operation, > + Buffer buffer) > +{ I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" & "Continue"? Or "First" & "Additional"? Or ... > + uint64 buf_state; > + BufferDesc *desc; > + > + if (BufferIsLocal(buffer)) > + return PrepareNewLocalReadBufferIO(operation, buffer); > + > + ResourceOwnerEnlarge(CurrentResourceOwner); > + desc = GetBufferDescriptor(buffer - 1); > + > + for (;;) > + { > + buf_state = LockBufHdr(desc); Perhaps worth adding an Assert(buf_state & BM_TAG_VALID)? > + /* Already valid, no work to do */ > + if (buf_state & BM_VALID) > + { > + UnlockBufHdr(desc); > + pgaio_wref_clear(&operation->io_wref); Perhaps we should clear &operation->io_wref once at the start? Because right now it'll be cleared if BM_VALID and it'll be set if BM_IO_IN_PROGRESS && &desc->io_wref, but it won't be touched when in BM_IO_IN_PROGRESS without a wref set. It seems either we should just touch &operation->io_wref if BM_IO_IN_PROGRESS && pgaio_wref_valid(&desc->io_wref) or we should reliably do it. > + return READ_BUFFER_ALREADY_DONE; > + } > + > + if (buf_state & BM_IO_IN_PROGRESS) > + { > + /* Join existing read */ > + if (pgaio_wref_valid(&desc->io_wref)) > + { > + operation->io_wref = desc->io_wref; > + operation->foreign_io = true; > + UnlockBufHdr(desc); > + return READ_BUFFER_IN_PROGRESS; > + } > + > + /* > + * If the wait ref is not valid but the IO is in > progress, someone > + * else started IO but hasn't set the wait ref yet. We > have no > + * choice but to wait until the IO completes. > + */ > + UnlockBufHdr(desc); > + pgaio_submit_staged(); > + WaitIO(desc); > + continue; Before this commit there was an explanation for this submit: - /* - * If this backend currently has staged IO, we need to submit the pending - * IO before waiting for the right to issue IO, to avoid the potential for - * deadlocks (and, more commonly, unnecessary delays for other backends). - */ Seems that vanished? > +/* > + * When building a new IO from multiple buffers, we won't include buffers > + * that are already valid or already in progress. This function should only > be > + * used for additional adjacent buffers following the head buffer in a new > IO. > + * > + * Returns true if the buffer was successfully prepared for IO and false if > it > + * is rejected and the read IO should not include this buffer. > + */ > +static bool > +PrepareAdditionalReadBuffer(Buffer buffer) I think it'd be good to mention that this may never wait for IO or such to avoid deadlocks. > + /* Check if we can start IO on the first to-be-read buffer */ > + if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) > < > + READ_BUFFER_READY_FOR_IO) > + { I don't love this < bit. For one there's no mention in PrepareReadBuffer_Status mentioning that the numerical order is important. Any reason to not just test != READ_BUFFER_READY_FOR_IO? The assignment inside the if also looks somewhat awkward. For while() loops there's often not really a better way to write it, but here you could just as well do the status assignment in a line before. > + pgaio_io_release(ioh); > + *nblocks_progress = 1; > + if (status == READ_BUFFER_ALREADY_DONE) > + { > + /* > + * Someone else has already completed this block, we're > done. > + * > + * When IO is necessary, ->nblocks_done is updated in > + * ProcessReadBuffersResult(), but that is not called > if no IO is > + * necessary. Thus update here. > + */ > + operation->nblocks_done += 1; > + Assert(operation->nblocks_done <= operation->nblocks); > + > + /* > + * Report and track this as a 'hit' for this backend, > even though > + * it must have started out as a miss in > PinBufferForBlock(). The > + * other backend will track this as a 'read'. > + */ > + CountBufferHit(operation->strategy, > + operation->rel, > operation->persistence, > + operation->smgr, > operation->forknum, > + operation->blocknum + > operation->nblocks_done - 1); > + return false; > + } > + > + /* The IO is already in-progress */ > + Assert(status == READ_BUFFER_IN_PROGRESS); > + CheckReadBuffersOperation(operation, false); I was about to suggest that there should be a CheckReadBuffersOperation() for both returns here, but there already are CheckReadBuffersOperation after calls to AsyncReadBuffers(), so I think this CheckReadBuffersOperation could just be removed. > /* > - * Check if we can start IO on the first to-be-read buffer. > - * > - * If an I/O is already in progress in another backend, we want to wait > - * for the outcome: either done, or something went wrong and we will > - * retry. > + * How many neighboring-on-disk blocks can we scatter-read into other > + * buffers at the same time? In this case we don't wait if we see an > I/O > + * already in progress. We already set BM_IO_IN_PROGRESS for the head > + * block, so we should get on with that I/O as soon as possible. > */ > - if (!ReadBuffersCanStartIO(buffers[nblocks_done], false)) > + for (int i = nblocks_done + 1; i < operation->nblocks; i++) > { > - /* > - * Someone else has already completed this block, we're done. > - * > - * When IO is necessary, ->nblocks_done is updated in > - * ProcessReadBuffersResult(), but that is not called if no IO > is > - * necessary. Thus update here. > - */ > - operation->nblocks_done += 1; > - *nblocks_progress = 1; > - > - pgaio_io_release(ioh); > - pgaio_wref_clear(&operation->io_wref); > - did_start_io = false; > + if (!PrepareAdditionalReadBuffer(buffers[i])) > + break; > + /* Must be consecutive block numbers. */ > + Assert(BufferGetBlockNumber(buffers[i - 1]) == > + BufferGetBlockNumber(buffers[i]) - 1); Seems this assert could stand to be before the PrepareAdditionalReadBuffer(), as it better hold true before we try to BM_IO_IN_PROGRESS? I realize this is old, but since you're whacking this around... > diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h > index 4017896f951..f85a9acc6ac 100644 > --- a/src/include/storage/bufmgr.h > +++ b/src/include/storage/bufmgr.h > @@ -147,6 +147,8 @@ struct ReadBuffersOperation > int flags; > int16 nblocks; > int16 nblocks_done; > + /* true if waiting on another backend's IO */ > + bool foreign_io; > PgAioWaitRef io_wref; > PgAioReturn io_return; > }; This adds an alignment-padding hole between nblocks_done and io_wref. Read stream can allocate quite a few of these, so it's probably worth avoiding? There's a padding hole between persistence and forknum, but that seems a bit ugly to utilize. A bit less ugly if we swapped forknum and persistence. Or we could make 'flags' a uint8/16 (flags should imo always be unsigned, and there are just four flag bits right now). But perhaps it's also not worth worrying about right now. [1] https://gcc.gnu.org/onlinedocs/gcc/Common-Attributes.html#index-pure Greetings, Andres Freund
