Re: index prefetching

Andres Freund Fri, 27 Feb 2026 15:52:32 -0800

Hi,

On 2026-02-24 13:13:25 -0500, Peter Geoghegan wrote:
> From bf097582e730c0f0e7c048139d5f30c2971af94c Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sun, 18 Jan 2026 11:18:11 -0500
> Subject: [PATCH v11 01/12] Extract fake LSN infrastructure from GiST index AM.
>
> Extract utility functions used by GiST to generate fake LSNs so that
> other index AMs can reuse this infrastructure to generate fake LSNs.
>
> Preparation for an upcoming commit that will change the rules around
> holding on to buffer pins on leaf pages in unlogged nbtree indexes
> (actually, in all cases barring scans that use a non-MVCC snapshot).
> This is the patch that will add the new amgetbatch interface.  Another
> preparatory commit will add fake LSN support to nbtree ahead of the
> amgetbatch commit.
>
> Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming
> XLOG_ASSIGN_LSN.


> +/*
> + * XLogGetFakeLSN - get a fake LSN for an index page that isn't WAL-logged.
> + *
> + * Some index AMs (nbtree, hash, GiST) use LSNs to detect concurrent page
> + * modifications, but not all index pages are WAL-logged.  This function
> + * provides a sequence of fake LSNs for that purpose.
> + *
> + * The behavior depends on the relation's persistence:
> + *
> + * - For temporary relations, we use a simple backend-local counter since
> + *   temporary relations are only accessible within our session.
> + *
> + * - For permanent relations when WAL-logging is disabled (e.g., during index
> + *   creation with wal_level=minimal), we use the current WAL insert 
> position.
> + *   If the insert position hasn't advanced since the last call, we emit a
> + *   dummy WAL record via XLogAssignLSN() to ensure we get a distinct LSN.
> + *
> + * - For unlogged relations, we use the global fake LSN counter maintained
> + *   by GetFakeLSNForUnloggedRel().
> + */
> +XLogRecPtr
> +XLogGetFakeLSN(Relation rel)
> +{
> +     if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
> +     {
> +             /*
> +              * Temporary relations are only accessible in our session, so a 
> simple
> +              * backend-local counter will do.
> +              */
> +             static XLogRecPtr counter = FirstNormalUnloggedLSN;
> +
> +             return counter++;
> +     }
> +     else if (RelationIsPermanent(rel))
> +     {
> +             /*
> +              * WAL-logging on this relation will start after commit, so its 
> LSNs
> +              * must be distinct numbers smaller than the LSN at the next 
> commit.
> +              * Emit a dummy WAL record if insert-LSN hasn't advanced after 
> the
> +              * last call.
> +              */
> +             static XLogRecPtr lastlsn = InvalidXLogRecPtr;
> +             XLogRecPtr      currlsn = GetXLogInsertRecPtr();
> +
> +             /* Shouldn't be called for WAL-logging relations */
> +             Assert(!RelationNeedsWAL(rel));
> +
> +             /* No need for an actual record if we already have a distinct 
> LSN */
> +             if (XLogRecPtrIsValid(lastlsn) && lastlsn == currlsn)
> +                     currlsn = XLogAssignLSN();
> +
> +             lastlsn = currlsn;
> +             return currlsn;
> +     }

I know you just moved this code, but isn't it kinda bogus to use if
(RelationIsPermanent()) here?  If the fake LSN is used on a durable page, it
wouldn't be safe to just use XLogAssignLSN(), as it'd not emit an FPI for the
page.

All the gist callers use if (!RelationNeedsWAL()) to protect calls to this, so
it's not a bug, but still seems somewhat odd.

I guess there is the "Assert(!RelationNeedsWAL(rel));", but at least the
comments of the function should make clear that this should only be called if
!RelationNeedsWAL().


> diff --git a/src/backend/storage/buffer/bufmgr.c 
> b/src/backend/storage/buffer/bufmgr.c
> index d1babaff0..a7a569c99 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4470,13 +4470,13 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, 
> IOObject io_object,
>        * lost after a crash anyway.  Most unlogged relation pages do not bear
>        * LSNs since we never emit WAL records for them, and therefore flushing
>        * up through the buffer LSN would be useless, but harmless.  However,
> -      * GiST indexes use LSNs internally to track page-splits, and therefore
> -      * unlogged GiST pages bear "fake" LSNs generated by
> -      * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
> -      * LSN counter could advance past the WAL insertion point; and if it did
> -      * happen, attempting to flush WAL through that location would fail, 
> with
> -      * disastrous system-wide consequences.  To make sure that can't happen,
> -      * skip the flush if the buffer isn't permanent.
> +      * some index AMs (nbtree, hash, GiST) use LSNs internally to detect
> +      * concurrent page modifications, and therefore unlogged index pages 
> bear
> +      * "fake" LSNs generated by XLogGetFakeLSN.  It is unlikely but possible
> +      * that the fake LSN counter could advance past the WAL insertion point;
> +      * and if it did happen, attempting to flush WAL through that location
> +      * would fail, with disastrous system-wide consequences.  To make sure
> +      * that can't happen, skip the flush if the buffer isn't permanent.
>        */
>       if (buf_state & BM_PERMANENT)
>               XLogFlush(recptr);

I'd probably not list the AMs here, because as of this commit, it'd not be
true yet.


Other than these points, I think it's probably worth pushing this soon?


BufferGetLSNAtomic()'s comment about not needing a lock if hint bits aren't
used looks a bit bogus to me with this use-case - it makes it sound like we
don't rely on the correctness of the return value if hint bits aren't
enabled. But that's bogus, we do.  The relevant difference afaict is that
without WAL logging of hint bits we will never set the LSN of a page that is
read locked, and therefore any caller that has at least a read lock doesn't
need GetLSNAtomic() unless XLogHintBitIsNeeded().




> From 64fae29f3b98219a8f32246b2adf103ee264f4d5 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sun, 18 Jan 2026 11:14:36 -0500
> Subject: [PATCH v11 02/12] Use fake LSNs to improve nbtree dropPin behavior.
>
> Previously unlogged nbtree indexes needed to hold on to a leaf page
> buffer pin when stopped on that leaf page, purely so that the
> _bt_killitems process had a way to be sure that there wasn't any unsafe
> concurrent TID recycling by VACUUM.  The _bt_killitems' dropPin strategy
> couldn't be used before now, since it works by checking if the page LSN
> has changed in the period after _bt_readpage read the page's items, but
> before _bt_killitems was called.  We now use the same LSN trick with
> unlogged indexes, bringing the same benefits to these scans that commit
> 2ed5b87f brought to scans of logged relations.
>
> This is preparation for an upcoming commit that will add the amgetbatch
> interface and switch nbtree over to it (from amgettuple).  That will go
> further by completely obviating the need for amgetbatch scans to hang on
> to buffer pins (barring scans involving a non-MVCC snapshot).
>
> Author: Peter Geoghegan <[email protected]>
> Discussion: 
> https://postgr.es/m/cah2-wzkehuhxyua8quc7rrn3etnxpiksjpfo8mhb+0dr2k0...@mail.gmail.com
> ---
>  src/backend/access/nbtree/README      |  5 +-
>  src/backend/access/nbtree/nbtdedup.c  |  8 ++-
>  src/backend/access/nbtree/nbtinsert.c | 48 +++++++++-------
>  src/backend/access/nbtree/nbtpage.c   | 82 +++++++++++++++------------
>  src/backend/access/nbtree/nbtree.c    |  8 ---
>  src/backend/access/nbtree/nbtsearch.c |  1 -
>  src/backend/access/nbtree/nbtutils.c  |  1 -
>  7 files changed, 80 insertions(+), 73 deletions(-)
>
> diff --git a/src/backend/access/nbtree/README 
> b/src/backend/access/nbtree/README
> index 53d4a61dc..cb921ca2e 100644
> --- a/src/backend/access/nbtree/README
> +++ b/src/backend/access/nbtree/README
> @@ -485,9 +485,8 @@ We handle this kill_prior_tuple race condition by having 
> affected index
>  scans conservatively assume that any change to the leaf page at all
>  implies that it was reached by btbulkdelete in the interim period when no
>  buffer pin was held.  This is implemented by not setting any LP_DEAD bits
> -on the leaf page at all when the page's LSN has changed.  (That won't work
> -with an unlogged index, so for now we don't ever apply the "don't hold
> -onto pin" optimization there.)
> +on the leaf page at all when the page's LSN has changed.  (This is why we
> +implement "fake" LSNs for unlogged index relations.)
>

Does the reference to btbulkdelete actually cover all the relevant cases?
Seems like a e.g. a concurrent _bt_delete_or_dedup_one_page() would also be a
problem?



One thing that's somewhat sad about this logic is that the
                latestlsn = BufferGetLSNAtomic(buf);
                Assert(so->currPos.lsn <= latestlsn);
                if (so->currPos.lsn != latestlsn)

check will trigger even if all the changes were done due to hint bit
FPIs. E.g. due to another backend hinting other index entries on the same page
as dirty... But anyway, that's not the "fault" of your change.

(sorry, that's just me trying to understand the code to be able to review)


> @@ -2720,25 +2730,25 @@ _bt_unlink_halfdead_page(Relation rel, Buffer 
> leafbuf, BlockNumber scanblkno,
>                       xlinfo = XLOG_BTREE_UNLINK_PAGE;
>
>               recptr = XLogInsert(RM_BTREE_ID, xlinfo);
> +     }
> +     else
> +             recptr = XLogGetFakeLSN(rel);
>
> -             if (BufferIsValid(metabuf))
> -             {
> -                     PageSetLSN(metapg, recptr);
> -             }
> -             page = BufferGetPage(rbuf);
> +     if (BufferIsValid(metabuf))
> +             PageSetLSN(metapg, recptr);
> +     page = BufferGetPage(rbuf);
> +     PageSetLSN(page, recptr);
> +     page = BufferGetPage(buf);
> +     PageSetLSN(page, recptr);
> +     if (BufferIsValid(lbuf))
> +     {
> +             page = BufferGetPage(lbuf);
>               PageSetLSN(page, recptr);
> -             page = BufferGetPage(buf);
> +     }
> +     if (target != leafblkno)
> +     {
> +             page = BufferGetPage(leafbuf);
>               PageSetLSN(page, recptr);
> -             if (BufferIsValid(lbuf))
> -             {
> -                     page = BufferGetPage(lbuf);
> -                     PageSetLSN(page, recptr);
> -             }
> -             if (target != leafblkno)
> -             {
> -                     page = BufferGetPage(leafbuf);
> -                     PageSetLSN(page, recptr);
> -             }
>       }
>
>       END_CRIT_SECTION();

This hunk ends up a bit more confusing because of removing the curlies around
PageSetLSN(metapg, recptr), but I also see why you removed them...


> diff --git a/src/backend/access/nbtree/nbtsearch.c 
> b/src/backend/access/nbtree/nbtsearch.c
> index 32ae0bda8..0ced7b02c 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -67,7 +67,6 @@ _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
>        * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
>        * when concurrent heap TID recycling by VACUUM might have taken place.
>        */
> -     Assert(RelationNeedsWAL(rel));
>       so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
>       _bt_relbuf(rel, so->currPos.buf);
>       so->currPos.buf = InvalidBuffer;

This is somewhat orthogonal and just quick idea:

I wonder if the worry about the overhead of BufferGetLSNAtomic() could be
addressed by adding a ReleaseBuffer() version that returns the current LSN.


Have you done any testing to see if the added BufferGetLSNAtomic() calls
introduce a measurable overhead when using unlogged tables?  If so, could the
BufferGetLSNAtomic() be skipped if there currently the scan doesn't currently
have any killed items?




> From c28cb1f517f458fc8976081bb52ca7310d3fff0a Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Tue, 9 Sep 2025 19:50:03 -0400
> Subject: [PATCH v11 03/12] Add interfaces that enable index prefetching.
>
> Add a new amgetbatch index AM interface that allows index access methods
> to implement plain/ordered index scans that return index entries in
> batches comprising all matching items from an index page, rather than
> one match at a time.
>
> This commit also adds a new table AM interface callback, called by the
> core executor through the new table_index_getnext_slot shim function.
> This allows the table AM to directly manage the progress of index scans
> rather than having individual TIDs passed in by the caller one by one.
> The amgetbatch interface is tightly coupled with the new approach to
> index scans added to the table AM.  The table AM can apply knowledge of
> which TIDs will be returned to the scan in the near future to perform
> I/O prefetching.  Prefetching will be added by an upcoming commit.
>
> Index access methods that support plain index scans must now implement
> either the amgetbatch interface OR the amgettuple interface.  The
> amgettuple interface will still be used by index AMs that require direct
> control over the progress of index scans (e.g., GiST with KNN ordered
> scans).  Almost all existing callers that perform index scans now use
> the new table_index_getnext_slot interface, regardless of whether the
> underlying index AM uses amgetbatch or amgettuple.
>
> The amgetbatch interface returns batches that hold a buffer pin on an
> index page that can be used by the table AM as an interlock against
> concurrent TID recycling by VACUUM.  Now heapam only needs to hold on to
> such a pin for an instant -- except during scans that use a non-MVCC
> snapshot.  Non-MVCC scans continue to need to hold the pin until all of
> the batch's TIDs have been fetched from the heap.
>
> This extends the dropPin mechanism added to nbtree by commit 2ed5b87f,
> and generalizes it to work with all index AMs that support the new
> amgetbatch interface.  We can always safely drop index page pins
> eagerly, provided the scan uses an MVCC snapshot (unlike the nbtree
> dropPin optimization, which had a couple of additional restrictions).
>
> An upcoming commit that will add index prefetching will use a read
> stream to read heap pages during index scans.  Read stream is careful to
> limit how many things it pins, lest we run into problems due to having
> too many buffers pinned.  Simply never holding on to index page buffer
> pins greatly simplifies resource management for index prefetching;
> there's no risk of unintended interactions between the read stream and
> index AM.  The only downside is that we cannot support prefetching
> during scans that use a non-MVCC snapshot, which seems quite acceptable.
>
> In practice, heapam doesn't drop each batch's index page buffer pin at
> the earliest opportunity during index-only scans.  This was deemed
> necessary to avoid regressing index-only scans with a LIMIT, in
> particular with nestloop anti-joins and nestloop semi-joins; eagerly
> loading all the visibility information up front regressed such queries.
> The new amgetbatch interface gives table AMs the authority to decide
> when and where to drop index page pins, so this can be considered a
> heapam implementation detail (index AMs don't need to know about it).
> This scheme still allows index prefetching to consistently hold no more
> than one batch index page pin at a time, even when an index-only scan
> (that must perform some heap fetches) holds open several index batches
> at once in order to maintain an adequate prefetch distance.
> ...
>  47 files changed, 2413 insertions(+), 1440 deletions(-)

This is a huge change. Is there a chance we can break it up into more
manageable chunks?


A first note: Seems like this needs to update doc/src/sgml/indexam.sgml?


> +/*
> + * Location of a BatchMatchingItem that appears in a IndexScanBatch returned
> + * by (and subsequently passed to) an amgetbatch routine
> + */
> +typedef struct BatchRingItemPos
> +{
> +     /* Position references a valid BatchRingBuffer.batches[] entry? */
> +     bool            valid;
> +
> +     /* BatchRingBuffer.batches[]-wise index to relevant IndexScanBatch */
> +     uint8           batch;
> +
> +     /* IndexScanBatch.items[]-wise index to relevant BatchMatchingItem */
> +     int                     item;
> +} BatchRingItemPos;

Does item actually need to be a 32bit int?  This way there's a hole and
presumably IndexScanBatch.items[] has a limited size?

Even if, any reason for it to not be unsigned?


> +/*
> + * Matching item returned by amgetbatch (in returned IndexScanBatch) during 
> an
> + * index scan.  Used by table AM to locate relevant matching table tuple.
> + */
> +typedef struct BatchMatchingItem
> +{
> +     ItemPointerData heapTid;        /* TID of referenced heap item */

If we're generalizing this, I'd be inclined to rename it away from heap to
table.

> +     OffsetNumber indexOffset;       /* index item's location within page */
> +     LocationIndex tupleOffset;      /* IndexTuple's offset in workspace, if 
> any */
> +} BatchMatchingItem;


> +/*
> + * Data about one batch of items returned by (and passed to) amgetbatch 
> during
> + * index scans
> + */
> +typedef struct IndexScanBatchData
> +{
> +     /*
> +      * Information output by amgetbatch index AMs upon returning a batch 
> with
> +      * one or more matching items, describing details of the index page 
> where
> +      * matches were located.
> +      *
> +      * Used in the next amgetbatch call to determine which index page to 
> read
> +      * next (or to determine if there's no further matches in current scan
> +      * direction).
> +      */
> +     BlockNumber currPage;           /* Index page with matching items */
> +     BlockNumber prevPage;           /* currPage's left link */
> +     BlockNumber nextPage;           /* currPage's right link */

Hm. Are these, and some of the later fields, generic enough for all index
types? There IIRC are index types out there that don't use buffers or blocks
in that sense.

I wonder if individual index AMs need a way to influence what is stored per
batch.  You could e.g. imagine something like this:

typedef struct BtreeIndexScanBatchData
{
    BlockNumber currPage;
    BlockNumber prevPage;
    ...

    IndexScanBatchData common;
} BtreeIndexScanBatchData;

and have the callers to indexam_util_batch_alloc() provide how much space
before BtreeIndexScanBatchData is needed (or have that be a property in the
IndexAmRoutine). That way the efficient FLEXIBLE_ARRAY_MEMBER addressing for
items could be kept.



> +     Buffer          buf;                    /* currPage buf (invalid means 
> unpinned) */
> +     XLogRecPtr      lsn;                    /* currPage's LSN */
> +
> +     /* scan direction when the index page was read */
> +     ScanDirection dir;
> +
> +     /*
> +      * knownEndLeft and knownEndRight are used by table AMs to track whether
> +      * there may be matching index entries to the left and right of 
> currPage,
> +      * respectively.  This helps them to avoid repeatedly calling 
> amgetbatch.
> +      */
> +     bool            knownEndLeft;
> +     bool            knownEndRight;

If I understand correctly, these mean that that we'd reach the end of the scan
in one direction?  If so, why does it use left/right instead of
forward/backward?


> +     /*
> +      * Matching items state for this batch.  Output by index AM for table 
> AM.
> +      *
> +      * The items array is always ordered in index order (ie, increasing
> +      * indexoffset).  When scanning backwards it is convenient for index AMs
> +      * to fill the array back-to-front, so we start at the last slot and 
> fill
> +      * downwards.  Hence they need a first-valid-entry and a 
> last-valid-entry
> +      * counter.
> +      */
> +     int                     firstItem;              /* first valid index in 
> items[] */
> +     int                     lastItem;               /* last valid index in 
> items[] */

I'd make these unsigned.

Not sure what "indexoffset" really means in a generic abstraction.


> +     /* info about killed items if any (killedItems is NULL if never used) */
> +     int                     numKilled;              /* number of currently 
> stored items */
> +     int                *killedItems;        /* indexes of killed items */

Signedness again.

I'd specify what is being indexed by killedItems.

What size is killedItems?


I wonder if "killed" is really the right term in the generic code, it's
unclear whether it refers to table or index items. And the past tense is a bit
odd too, since nothing may yet been "killed".


> +     /*
> +      * If we are doing an index-only scan, this array stores the per-item
> +      * visibility flags BATCH_VIS_CHECKED and BATCH_VIS_ALL_VISIBLE
> +      */
> +     uint8      *visInfo;            /* per-item visibility flags, or NULL */

Why just for index only?  Seems like it'd be rather useful to batch visibility
checks for plain index scans too?  Obviously that doesn't have to be
implemented immediately, but I'd probably not preclude it in generic
infrastructure.


This seems pretty heap specific.  I guess you didn't put it in
IndexFetchHeapData because it'd be somewhat complicated to per-batch state in
there?

Perhaps my earlier suggestion about having "private" per-batch state for the
indexam is required for table AMs too.  I guess we could store an offset to
the "tableam private" and "indexam private" data in each batch.


> +     /*
> +      * If we are doing an index-only scan, this is the tuple storage 
> workspace
> +      * for the matching tuples (tuples referenced by items[]).  It is of 
> size
> +      * BLCKSZ, so it can hold as much as a full page's worth of tuples.
> +      * currTuples points into the trailing portion of this allocation, just
> +      * past items[].  It is NULL for plain index scans.
> +      */

This was a bit hard to understand before, but in generic code it seems like it
is too specific to the way it was used in nbtree.


> +     char       *currTuples;         /* tuple storage for items[] */
> +     BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items */
> +} IndexScanBatchData;
> +
> +typedef struct IndexScanBatchData *IndexScanBatch;

I wish we would just stop doing this stupid pointer hiding game.


> +
> +/*
> + * Maximum number of batches (leaf pages) we can keep in memory.  We need a
> + * minimum of two, since we'll only consider releasing one batch when another
> + * is read.
> + */
> +#define INDEX_SCAN_MAX_BATCHES               2
> +#define INDEX_SCAN_CACHE_BATCHES     2

Hm. These limits are increased in a later patch adding prefetching to
heapam. I guess you kept it low until then to avoid some overhead?  Perhaps
that's indicative that hardcoding it isn't great?



> diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
> index 119593b7b..8611b03e5 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -433,11 +433,29 @@ typedef struct TableAmRoutine
>        */
>       void            (*index_fetch_end) (struct IndexFetchTableData *data);
>
> +     /*
> +      * Fetch the next tuple from an index scan into slot, scanning in the
> +      * specified direction, and return true if a tuple was found, false
> +      * otherwise.
> +      *
> +      * This callback allows the table AM to directly manage the scan 
> process,
> +      * including interfacing with the index AM. The caller simply specifies
> +      * the direction of the scan; the table AM takes care of retrieving TIDs
> +      * from the index, performing visibility checks, and returning tuples in
> +      * the slot.
> +      */
> +     bool            (*index_getnext_slot) (IndexScanDesc scan,
> +                                                                        
> ScanDirection direction,
> +                                                                        
> TupleTableSlot *slot);

Hm, it's somewhat confusing that we have this now, but still have
->index_fetch_tuple() etc.


> diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
> index 63c067d5a..7c1b427fb 100644
> --- a/src/include/nodes/execnodes.h
> +++ b/src/include/nodes/execnodes.h
> @@ -1723,7 +1723,7 @@ typedef struct IndexScanState
>       ExprContext *iss_RuntimeContext;
>       Relation        iss_RelationDesc;
>       struct IndexScanDescData *iss_ScanDesc;
> -     IndexScanInstrumentation iss_Instrument;
> +     IndexScanInstrumentation *iss_Instrument;
>       SharedIndexScanInstrumentation *iss_SharedInfo;

Don't really have an opinion about this, but it's not clear to me why this
patch changed this?

If we want to change it, could we do that in a prereq patch?


> @@ -124,23 +139,29 @@ heapam_index_fetch_tuple(struct IndexFetchTableData 
> *scan,
>
>       Assert(TTS_IS_BUFFERTUPLE(slot));
>
> -     /* We can skip the buffer-switching logic if we're in mid-HOT chain. */
> -     if (!*call_again)
> +     /* We can skip the buffer-switching logic if we're on the same page. */
> +     if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
>       {
> -             /* Switch to correct buffer if we don't have it already */
> -             Buffer          prev_buf = hscan->xs_cbuf;
> +             Assert(!*call_again);
>
> -             hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
> -                                                                             
>           hscan->xs_base.rel,
> -                                                                             
>           ItemPointerGetBlockNumber(tid));
> +             /* Remember this buffer's block number for next time */
> +             hscan->xs_blk = ItemPointerGetBlockNumber(tid);
> +
> +             if (BufferIsValid(hscan->xs_cbuf))
> +                     ReleaseBuffer(hscan->xs_cbuf);
> +
> +             hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
>
>               /*
> -              * Prune page, but only if we weren't already on this page
> +              * Prune page when it is pinned for the first time
>                */
> -             if (prev_buf != hscan->xs_cbuf)
> -                     heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
> +             heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
>       }
>
> +     Assert(BufferIsValid(hscan->xs_cbuf));
> +     Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
> +     Assert(hscan->xs_blk == ItemPointerGetBlockNumber(tid));
> +
>       /* Obtain share-lock on the buffer so we can examine visibility */
>       LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
>       got_heap_tuple = heap_hot_search_buffer(tid,

This seems independent too. Let's just introduce xs_blk and do this change
separately? The we can get it out of the way?



> + * heapam_batch_resolve_visibility
> + *           Obtain visibility information for a TID from caller's batch.
> + *
> + * Called during index-only scans.  We always check the visibility of 
> caller's
> + * item (an offset into caller's batch->items[] array).  We might also set
> + * visibility info for other items from caller's batch more proactively when
> + * that makes sense.
> + *
> + * We keep two competing considerations in balance when determining whether 
> to
> + * check additional items: the need to keep the cost of visibility map access
> + * under control when most items will never be returned by the scan anyway
> + * (important for inner index scans of anti-joins and semi-joins), and the
> + * need to not hold onto index leaf pages for too long.
> + *
> + * Note on Memory Ordering Effects
> + * -------------------------------
> + *
> + * visibilitymap_get_status does not lock the visibility map buffer, and
> + * therefore the result we read here could be slightly stale.  However, it
> + * can't be stale enough to matter.
> + *
> + * We need to detect clearing a VM bit due to an insert right away, because
> + * the tuple is present in the index page but not visible.  The reading of 
> the
> + * TID by this scan (using a shared lock on the index buffer) is serialized
> + * with the insert of the TID into the index (using an exclusive lock on the
> + * index buffer).  Because the VM bit is cleared before updating the index,
> + * and locking/unlocking of the index page acts as a full memory barrier, we
> + * are sure to see the cleared bit if we see a recently-inserted TID.

I don't think the index page locking on the inserter side relevant - that side
of the memory barrier paring is provided solely by visibilitymap_set(), which
uses an exclusive lock and thus also provides a barrier (so does the heap
buffer locking).

The relevant index locking would be the lock on the index page during *lookup*
leading to the call to heapam_batch_resolve_visibility().  But I guess there
could be an indexam that doesn't use locking? Seems unlikely, but ...


> + * Deletes do not update the index page (only VACUUM will clear out the TID),

Well, not just, right? We can only prune when inserting into the index?


> + * so the clearing of the VM bit by a delete is not serialized with this test
> + * below, and we may see a value that is significantly stale.  However, we
> + * don't care about the delete right away, because the tuple is still visible
> + * until the deleting transaction commits or the statement ends (if it's our
> + * transaction).  In either case, the lock on the VM buffer will have been
> + * released (acting as a write barrier) after clearing the bit.  And for us 
> to
> + * have a snapshot that includes the deleting transaction (making the tuple
> + * invisible), we must have acquired ProcArrayLock after that time, acting as
> + * a read barrier.
> + *
> + * It's worth going through this complexity to avoid needing to lock the VM
> + * buffer, which could cause significant contention.
> + */

If we are concerned about read side ordering, it seems a memory barrier on the
read side would suffice (and would not cause contention).

There's a much more recent memory barrier already than the ProcArrayLock
during snapshot acquisition - the pinning of the visibilitymap is a barrier.

Which is good, because we should really remove the ProcArrayLock acquisition
when we are able to reuse snapshots... And it'd be nasty if that introduced a
hard to hit memory ordering issue.


> +static void
> +heapam_batch_resolve_visibility(IndexScanDesc scan, IndexScanBatch batch,
> +                                                             
> BatchRingItemPos *pos)
> +{

FWIW, for me this is not inlined into the caller, which means that there's a
function call even for the case where the visibility information would already
be cached.


> +     IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> +     int                     posItem = pos->item;
> +     int                     noSetItem,
> +                             step;
> +     bool            allbatchitemvisible;
> +
> +     /* Do nothing if we already resolved visibility for the item. */
> +     if (batch->visInfo[posItem] & BATCH_VIS_CHECKED)
> +             return;

I'd at least mark this branch likely, so that the compiler can see that it'd
make sense to move this check into the callsites. At least on gcc that seems
to suffice for the hot path to not need the function call.

Or you have a static inline wrapper that does the check and then calls the
bulk of heapam_batch_resolve_visibility() iff the cache isn't sufficient.



> +     /* We better still have a pin on batch's index page */
> +     Assert(BufferIsValid(batch->buf));
> +
> +     /* Determine the range of items to set visibility for */
> +     if (ScanDirectionIsForward(batch->dir))
> +     {
> +             noSetItem = Min(batch->lastItem + 1, posItem + 
> hscan->xs_vm_items);
> +             allbatchitemvisible = noSetItem > batch->lastItem &&
> +                     (posItem == batch->firstItem ||
> +                      (batch->visInfo[batch->firstItem] & 
> BATCH_VIS_CHECKED));
> +             step = 1;
> +     }
> +     else
> +     {
> +             noSetItem = Max(batch->firstItem - 1, posItem - 
> hscan->xs_vm_items);
> +             allbatchitemvisible = noSetItem < batch->firstItem &&
> +                     (posItem == batch->lastItem ||
> +                      (batch->visInfo[batch->lastItem] & BATCH_VIS_CHECKED));
> +             step = -1;
> +     }
> +
> +     /*
> +      * Set visibility info for a range of items, in scan order.
> +      *
> +      * noSetItem is the first item (in the given scan direction) that won't 
> be
> +      * set during this call.  noSetItem often points to just past the end of
> +      * (or just before the start of) the batch's 'items' array.
> +      *
> +      * We iterate this way to avoid the need for 2 direction-specific loops,
> +      * since this is a hot code path that's sensitive to code size 
> increases.
> +      */
> +     for (int setItem = posItem; setItem != noSetItem; setItem += step)
> +     {
> +             ItemPointer tid = &batch->items[setItem].heapTid;
> +             uint8           flags = BATCH_VIS_CHECKED;
> +
> +             if (VM_ALL_VISIBLE(scan->heapRelation,
> +                                                
> ItemPointerGetBlockNumber(tid),
> +                                                &hscan->vmbuf))
> +                     flags |= BATCH_VIS_ALL_VISIBLE;
> +
> +             batch->visInfo[setItem] = flags;
> +     }

Hm. It'll be pretty common for the block number of two consecutive items to be
on the same page.  The comments here suggest that the VM lookup is a relevant
performance factor, so why not check if the last VM_ALL_VISIBLE() was for the
same block?

While visibilitymap_get_status() isn't particularly expensive, it still is at
least two external function calls (visibilitymap_get_status() and
BufferGetBlockNumber()) and some nontrivial math.


Heh, indeed: A quick hacky implementation makes
  SELECT count(*) FROM pgbench_accounts
about 16% faster.



> +     /*
> +      * It's safe to drop the batch's buffer pin as soon as we've resolved 
> the
> +      * visibility status of all of its items
> +      */
> +     if (allbatchitemvisible && scan->MVCCScan)
> +     {
> +             Assert(batch->visInfo[batch->firstItem] & BATCH_VIS_CHECKED);
> +             Assert(batch->visInfo[batch->lastItem] & BATCH_VIS_CHECKED);
> +
> +             ReleaseBuffer(batch->buf);
> +             batch->buf = InvalidBuffer;
> +     }

It doesn't seem like it can be right that heapam releases an index buffer
directly here.  What, e.g., if the indexam requires multiple buffers to be
pinned for correctness?


> +/* ----------------
> + *           heapam_index_getnext_slot - get the next tuple from a scan
> + *
> + * The result is true if a tuple satisfying the scan keys and the snapshot 
> was
> + * found, false otherwise.  The tuple is stored in the specified slot.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will 
> be
> + * dropped by a future call here (or by a later call to index_endscan).
> + *
> + * Note: caller must check scan->xs_recheck, and perform rechecking of the
> + * scan keys if required.  We do not do that here because we don't have
> + * enough information to do it efficiently in the general case.
> + * ----------------
> + */
> +static pg_attribute_hot bool
> +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> +                                               TupleTableSlot *slot)
> +{
> +     IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> +     ItemPointer tid = NULL;
> +
> +     for (;;)
> +     {
> +             if (!scan->xs_heap_continue)
> +             {
> +                     /*
> +                      * Scans that use an amgetbatch index AM are managed by 
> heapam's
> +                      * index scan manager.  This gives heapam the ability 
> to read heap
> +                      * tuples in a flexible order that is attuned to both 
> costs and
> +                      * benefits on the heapam and table AM side.
> +                      *
> +                      * Scans that use an amgettuple index AM simply call 
> through to
> +                      * index_getnext_tid to get the next TID returned by 
> index AM. The
> +                      * progress of the scan will be under the control of 
> index AM (we
> +                      * just pass it through a direction to get the next 
> tuple in), so
> +                      * we cannot reorder any work.
> +                      */
> +                     if (scan->usebatchring)
> +                             tid = heapam_batch_getnext_tid(scan, hscan, 
> direction);
> +                     else
> +                     {

It might be worth putting this function into a static inline with a
usebatchring argument, and calling it with a constant true/false from the
top-level of heapam_index_getnext_slot().  Right now the code is a fair bit
less tight due to both branches being supported inside the loop.

Or even just storing scan->usebatchring in a local var (and passing it to
index_fetch_heap()), so the compiler at least can realize that it won't change
between iterations.


> +                             tid = index_getnext_tid(scan, direction);
> +
> +                             if (tid != NULL && scan->xs_want_itup)
> +                                     scan->xs_visible = 
> VM_ALL_VISIBLE(scan->heapRelation,
> +                                                                             
>                           ItemPointerGetBlockNumber(tid),
> +                                                                             
>                           &hscan->vmbuf);

I'm a bit confused about the need for xs_visible (and also xs_heaptid, but
that's old).  Afaict it's just used to store very transient information that
could just be passed around upwards?

Storing and then shortly after reading data can lead to increased latency due
to store-forwarding. And reading from memory is more expensive than just doing
it via the register. And the compiler can't optimize the memory writes out in
this kinda code, because it won't be able to proof it's never read later.




> +                     else
> +                     {
> +                             /*
> +                              * We didn't access the heap, so we'll need to 
> take a
> +                              * predicate lock explicitly, as if we had.  
> For now we do
> +                              * that at page level.
> +                              */
> +                             PredicateLockPage(hscan->xs_base.rel,
> +                                                               
> ItemPointerGetBlockNumber(tid),
> +                                                               
> scan->xs_snapshot);
> +                     }

FWIW, this does show up in profiles here (partially due to overhead of the
call, partially due to it being annoyingly expensive to get the block from a
tid).  Seems we could amortise this from once-per-tuple to once-per-block for
the pretty common case of there being multiple tids for the same block in a
row?


> +                     /*
> +                      * Return matching index tuple now set in scan->xs_itup 
> (or return
> +                      * matching heap tuple now set in scan->xs_hitup).
> +                      *
> +                      * Note: we won't usually have fetched a heap tuple 
> into caller's
> +                      * table slot.  This is per the 
> table_index_getnext_slot contract
> +                      * for scan->xs_want_itup callers.
> +                      */
> +                     return true;

That contract is ... not very clear to me.  table_index_getnext_slot() says:

 * Fetch the next tuple from an index scan into `slot`, scanning in the
 * specified direction. Returns true if a tuple was found, false otherwise.
 *
 * The index scan should have been started via table_index_fetch_begin().
 * Callers must check scan->xs_recheck and recheck scan keys if required.
 *
 * Index-only scan callers (that pass xs_want_itup=true to index_beginscan)
 * can consume index tuple results by examining IndexScanDescData fields such
 * as xs_itup and xs_hitup.  The table AM won't usually fetch a heap tuple
 * into the provided slot in the case of xs_want_itup=true callers.


I don't know what "won't usually" here really means and waht the caller is
supposed to do with that.


It's decidedly not free to store a tuple in a slot, so it seems a bit silly
that we do so unnecessarily if we have to verify tuple visibility.


(I didn't look too closely at further parts of the commit, as it seemed the
shape of the rest might change in response to earlier feedback).





> From 34ad62ef8f43550ccc2c0b2e2c41f30205d37716 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Thu, 12 Feb 2026 14:06:47 -0500
> Subject: [PATCH v11 04/12] Don't allocate _bt_search stack

IIC we've talked about this one for a while. Seems we could pull it ahead of
the larger 03?  Not pointlessly allocating memory only needed for modification
during read-only searching seems like a hard-to-argue about improvement.



> Subject: [PATCH v11 05/12] Limit get_actual_variable_range to scan one index
>  leaf page.

I skipped looking at this, as it seems like a separate axis. I have marked it
as something to look at in the other thread though.



> From 3fbfb87b6f3ca3c4246c1d3145b452b25f398421 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[email protected]>
> Date: Tue, 13 Jan 2026 20:44:14 +0100
> Subject: [PATCH v11 06/12] Introduce read_stream_{pause,resume,yield}().


> Note: I (pgeoghegan) have extended read_stream_yield to support yielding
> that avoids affecting the io_combine_limit mechanism.

I doubt that goes far enough into fixing the problem with yielding
early. Submitting a pending read before yielding makes sense, but it doesn't
really help with actually issuing IO early enough to actually get benefits
from AIO.



> From d1c7743540df015ba6d1f189b5b87086437d8a04 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sat, 15 Nov 2025 14:03:58 -0500
> Subject: [PATCH v11 07/12] Add heapam index scan I/O prefetching.
>
> This commit implements I/O prefetching for index scans (and index-only
> scans that require heap fetches). This was made possible by the recent
> addition of batching interfaces to both the table AM and index AM APIs.
>
> The amgetbatch index AM interface provides batches of matching TIDs
> (rather than one tuple at a time), each of which must be taken from
> index tuples that appear together on a single index page.  This allows
> multiple batches to be held open simultaneously.  Giving the table AM an
> explicit understanding of index AM concepts/index page boundaries allows
> it to consider all of the relevant costs and benefits.
>
> Prefetching is implemented using a prefetching position under the
> control of the table AM and core code.  This is closely related to the
> scan position added by commit FIXME, which introduced the amgetbatch
> interface.  A read stream callback advances the read stream as needed to
> provide sufficiently many heap block numbers to maintain the read
> stream's target prefetch distance.
>
> Testing has shown that index prefetching can make index scans much
> faster.  Large range scans that return many tuples can be as much as 35x
> faster.
>
> An important goal of the amgetbatch design is to enable the table AM's
> read stream callback to advance its prefetch position using TIDs that
> appear on a leaf page that's ahead of the current scan position's leaf
> page.  This is crucial with scans of indexes where each leaf page
> happens to have relatively few distinct heap blocks among its matching
> TIDs (as well as with scans with leaf pages that have relatively few
> total matching items).  Index scans can have as many as 64 open batches,
> which testing has shown to be about the maximum number that can ever be
> useful.  Batches are maintained in scan order using a simple ring buffer
> data structure.
>
> In rare cases where the scan exceeds this quasi-arbitrary limit of 64,
> the read stream is temporarily paused.  Prefetching (via the read
> stream) is resumed only after the scan position advances beyond its
> current open batch and then frees and removes the batch from the scan's
> batch ring buffer.  Testing has shown that it isn't very common for
> scans to hold open more than about 10 batches to get the desired I/O
> prefetch distance.



> +/*
> + * Handle a change in index scan direction (at the tuple granularity).
> + *
> + * Resets the read stream, since we can't rely on scanPos continuing to agree
> + * with the blocks that read stream already consumed using prefetchPos.
> + */
> +static pg_noinline void
> +heapam_dirchange_readstream_reset(IndexFetchHeapData *hscan,
> +                                                               
> BatchRingBuffer *batchringbuf,
> +                                                               ScanDirection 
> direction)
> +{
> +     /* Reset read stream state */
> +     batchringbuf->prefetchPos.valid = false;
> +     hscan->xs_yield_check = false;
> +     hscan->xs_paused = false;
> +     hscan->xs_read_stream_dir = NoMovementScanDirection;    /* see note 
> below */
> +
> +     /* Reset read stream itself */
> +     if (hscan->xs_read_stream)
> +             read_stream_reset(hscan->xs_read_stream);
> +
> +     /*
> +      * Finally, remember new scan direction.
> +      *
> +      * Note: we needed to set xs_read_stream_dir to NoMovementScanDirection
> +      * momentarily to avoid spuriously prefetching more blocks from within 
> the
> +      * read stream callback.  Once we return, the read stream can be used to
> +      * fetch blocks in the opposite scan direction.

I don't follow, if we are resetting, nothing should look at the
xs_read_stream_dir.   Ah, I see that you elsewhere have a comment saying this
is just needed due to a bug.



> +     /*
> +      * Consider if we need to yield, if we haven't done so already for the
> +      * current prefetchBatch.  We yield at the natural boundary between
> +      * exhausting one batch's items and loading the next, giving the scan a
> +      * chance to return scanPos items and potentially end entirely (e.g., 
> with
> +      * LIMIT) before we eagerly commit to loading another batch.
> +      *
> +      * When we yield, the scan will return at least one more scanPos item,
> +      * often several more (particularly during scans of indexes with high
> +      * physical/logical correlation, where batches contain groups of 
> adjacent
> +      * items whose TIDs all point to the same heap page).
> +      *
> +      * Prefetching can continue once the scan requests the buffer for the 
> next
> +      * enqueued heap block by calling read_stream_next_buffer.  We weigh the
> +      * need to keep the scan responsive (to avoid senselessly doing a large
> +      * amount of work when we should just return a scanPos tuple 
> immediately)
> +      * against the need for the read stream to maintain its prefetch 
> distance
> +      * (if we pause too much it'll hurt the stream's ability to maintain a
> +      * sufficient prefetch distance when I/O bound).
> +      *
> +      * Keeping the scan responsive is important during index-only scans that
> +      * require only a few heap fetches.  It also matters when allowing the
> +      * scan to return just a few more items has the potential to allow the
> +      * scan to end entirely (e.g., with an ORDER BY ... LIMIT N, or within a
> +      * scan that feeds into a merge join).
> +      */
> +     else if (!hscan->xs_yield_check)
> +     {

Minor nit:  I find it pretty odd that we check for yielding if xs_yield_check
is false.


> +             hscan->xs_yield_check = true;
> +             if (!hscan->xs_rampup_done)
> +             {
> +                     /*
> +                      * Never yield during the initial ramp-up phase of the 
> scan.
> +                      * Yielding too early prevents the read stream from 
> building up an
> +                      * adequate prefetch distance.
> +                      */
> +                     if (batchringbuf->headBatch >= 4)
> +                             hscan->xs_rampup_done = true;

Where is that 4 coming from? It doesn't seem like having processed more than 4
batches is a particularly good proxy for anything?


> +             }
> +
> +             if (hscan->xs_rampup_done)
> +                     return read_stream_yield(stream);

I don't really understand what logic behind yielding is.

>From what I can tell we'll basically yield once a batch after reaching
headBatch >= 4 (because heapam_batch_getnext() set xs_yield_check = false, and
).

That's maybe fine if one index batch references a lot of heap pages, but if it
instead returns tids for consecutive table blocks, it's highly likely that the
entire batch can be executed as a single IO for heap, due to IO combining
(unless part of the data is in the buffer pool).  Which, in turn, means that
it's unlikely to have more than one IO in flight in that case - the yielding
will trigger one IO and IO will only be resumed the next time
read_stream_next_buffer() is called, which will wait for the IO started before
yielding.

Seems at the very least this would need to be tied to the difference between
the scan and prefetch pos and increase the distance between the two that needs
to be reached to yield again.


What's the workload showing the need for yielding most extremely? Is that the
merge join case we talked about before? I'd like to experiment with a few
other approaches.



Ran out of time & energy for the moment, more later.


Greetings,

Andres Freund

Re: index prefetching

Reply via email to