Re: index prefetching

Andres Freund Sat, 28 Feb 2026 14:08:31 -0800

Hi,

On 2026-02-28 14:16:22 -0500, Peter Geoghegan wrote:
> > This is a huge change. Is there a chance we can break it up into more
> > manageable chunks?
>
> I'm open to any ideas about that (like the one you go on to suggest
> about introducing IndexFetchHeapData.xs_blk in its own commit).


The indexscan stats changes, if we want them, are also pretty verbose and
pretty independent.

It'd be great if the move of indexonly visibility check handling from
nodeIndexonlyscan.c into the table AM could be split out, but it might be too
churn-y.  But getting it out of the way first would be nice.


> > A first note: Seems like this needs to update doc/src/sgml/indexam.sgml?
>
> I arbitrarily decided to put all of the sgml doc updates into the
> later prefetching commit. Purely because it was hard to avoid talking
> about prefetching in the docs. I can put minimal sgml doc updates in
> the amgetbatch commit.

Ah, ok.


> > > +/*
> > > + * Location of a BatchMatchingItem that appears in a IndexScanBatch 
> > > returned
> > > + * by (and subsequently passed to) an amgetbatch routine
> > > + */
> > > +typedef struct BatchRingItemPos
> > > +{
> > > +     /* Position references a valid BatchRingBuffer.batches[] entry? */
> > > +     bool            valid;
> > > +
> > > +     /* BatchRingBuffer.batches[]-wise index to relevant IndexScanBatch 
> > > */
> > > +     uint8           batch;
> > > +
> > > +     /* IndexScanBatch.items[]-wise index to relevant BatchMatchingItem 
> > > */
> > > +     int                     item;
> > > +} BatchRingItemPos;
> >
> > Does item actually need to be a 32bit int?  This way there's a hole and
> > presumably IndexScanBatch.items[] has a limited size?
>
> We don't allocate more than one of these BatchRingItemPos structs
> anywhere, and always do so on the stack.

Hm? There are two in BatchRingBuffer and I don't see any on the stack?

It probably still doesn't matter though, compared to the size of
BatchRingBuffer.batches (once it's beyond the toy size).


> I think that we rely on them
> being signed within _bt_readpage, when it generates its return value,
> here:
>
> """
> return (newbatch->firstItem <= newbatch->lastItem);
> """

> During a forwards scan _bt_readpage that returns no items, we'll
> return false when "newbatch->lastItem = itemIndex - 1;" makes
> newbatch->lastItem have the value -1 (note that newbatch->firstItem is
> always 0 during a _bt_readpage in the forwards scan direction).

Isn't that about a different field? Ah, I see, you later say they have to be
the same type as BatchRingItemPos.item.


This actually reminds me that it's not entirely clear to me what firstItem,
lastItem mean:

        /*
         * Matching items state for this batch.  Output by index AM for table 
AM.
         *
         * The items array is always ordered in index order (ie, increasing
         * indexoffset).  When scanning backwards it is convenient for index AMs
         * to fill the array back-to-front, so we start at the last slot and 
fill
         * downwards.  Hence they need a first-valid-entry and a 
last-valid-entry
         * counter.
         */
        int                     firstItem;              /* first valid index in 
items[] */
        int                     lastItem;               /* last valid index in 
items[] */

I think it should at very least state that -1 is used to indicate there are no
items.

I take it that if firstItem == lastItem, there's exactly one item?




> > > +     BlockNumber currPage;           /* Index page with matching items */
> > > +     BlockNumber prevPage;           /* currPage's left link */
> > > +     BlockNumber nextPage;           /* currPage's right link */
> >
> > Hm. Are these, and some of the later fields, generic enough for all index
> > types? There IIRC are index types out there that don't use buffers or blocks
> > in that sense.
> >
> > I wonder if individual index AMs need a way to influence what is stored per
> > batch.
>
> How strongly do you feel about it?

I think it's a problem, too prescriptive about what exactly an indexam might
need and how it stores data.


> Many months ago, this struct provided a way for index AMs to provide
> their own state in each batch. But I ultimately concluded that the
> overhead of that approach was a problem.

With the approach that I hand waved at, namely to use one allocation for both
the AM private state and the generic state, that should not be a real issue?
There should be maybe 2-3 cheap instructions to translate?  I've expanded on
that further down.


> > > +     Buffer          buf;                    /* currPage buf (invalid 
> > > means unpinned) */
> > > +     XLogRecPtr      lsn;                    /* currPage's LSN */
> > > +
> > > +     /* scan direction when the index page was read */
> > > +     ScanDirection dir;
> > > +
> > > +     /*
> > > +      * knownEndLeft and knownEndRight are used by table AMs to track 
> > > whether
> > > +      * there may be matching index entries to the left and right of 
> > > currPage,
> > > +      * respectively.  This helps them to avoid repeatedly calling 
> > > amgetbatch.
> > > +      */
> > > +     bool            knownEndLeft;
> > > +     bool            knownEndRight;
> >
> > If I understand correctly, these mean that that we'd reach the end of the 
> > scan
> > in one direction?  If so, why does it use left/right instead of
> > forward/backward?
>
> The names are based on the existing, similar moreLeft and moreRight
> fields used by both nbtree and hash. I'm not attached to any of these
> names. Feel free to suggest alternatives.

knownEnd{Forward, Backward} or isEnd{Forward,Bckward} seems like it'd do the
trick?


> > > +     /*
> > > +      * Matching items state for this batch.  Output by index AM for 
> > > table AM.
> > > +      *
> > > +      * The items array is always ordered in index order (ie, increasing
> > > +      * indexoffset).  When scanning backwards it is convenient for 
> > > index AMs
> > > +      * to fill the array back-to-front, so we start at the last slot 
> > > and fill
> > > +      * downwards.  Hence they need a first-valid-entry and a 
> > > last-valid-entry
> > > +      * counter.
> > > +      */
> > > +     int                     firstItem;              /* first valid 
> > > index in items[] */
> > > +     int                     lastItem;               /* last valid index 
> > > in items[] */
> >
> > I'd make these unsigned.
>
> They must be the same type as BatchRingItemPos.item offsets. So I
> could probably make them int16, but it wouldn't be easy to make them
> unsigned due to the aforementioned _bt_readpage issue.
>
> > Not sure what "indexoffset" really means in a generic abstraction.
>
> It means an offset into the currTuples buffer at which the table AM
> can find an IndexTuple to return, during an index-only scan. Can you
> suggest an alternative terminology and/or presentation?


        /*
         * If we are doing an index-only scan, these are the tuple storage
         * workspaces for the matching tuples (tuples referenced by items[]). 
Each
         * is of size BLCKSZ, so it can hold as much as a full page's worth of
         * tuples.
         */
        char       *currTuples;         /* tuple storage for items[] */
        BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items */

I previously read the "these are" to reference currTuples and items, but I
think the plural is actually just an anachronism, from when currTuples was
followed by markTuples?

It says that it's the tuple storage for items[], but I think that may just be
trying to say that it's the tuple storage corresponding the items in ->items[].



> > > +     /*
> > > +      * If we are doing an index-only scan, this array stores the 
> > > per-item
> > > +      * visibility flags BATCH_VIS_CHECKED and BATCH_VIS_ALL_VISIBLE
> > > +      */
> > > +     uint8      *visInfo;            /* per-item visibility flags, or 
> > > NULL */
> >
> > Why just for index only?  Seems like it'd be rather useful to batch 
> > visibility
> > checks for plain index scans too?  Obviously that doesn't have to be
> > implemented immediately, but I'd probably not preclude it in generic
> > infrastructure.
>
> Whether or not we allocate visInfo space from the generic indexbatch.c
> routines is determined by IndexScanDesc.xs_want_itup, though.

Yea, that's my point, why should it be controlled that way :). If you e.g. had
an undo based AM, the gain from batch resolving visibility for non-index-only
scans would be an order of magnitude bigger than it is in heap, but with this
coupling it couldn't easily.


> > This seems pretty heap specific.  I guess you didn't put it in
> > IndexFetchHeapData because it'd be somewhat complicated to per-batch state 
> > in
> > there?
>
> Yes. And because we really want to keep the visInfo space right next
> to the corresponding BatchMatchingItem/items[] state. Allocating and
> releasing all batch state together is important.


> > Perhaps my earlier suggestion about having "private" per-batch state for the
> > indexam is required for table AMs too.  I guess we could store an offset to
> > the "tableam private" and "indexam private" data in each batch.
>
> I think that would be painful. Especially if the goal is to allocate
> visInfo in memory that is exclusively managed by heapam.

I agree it shouldn't be exclusively managed by heapam and that it needs to be
stored together with the batch.  But it could be a size determined at the time
of index_beginscan() as part of table_index_fetch_begin(heapRelation), e.g. a
field in the returned IndexFetchTableData.

I'm basically thinking that each batch would be stored like this:

   [table am specific data]
   [index am specific data]
   struct IndexScanBatchData
   [variable amount of items]

With the offset from the *IndexScanBatchData to the index/table data stored
inside IndexScanDesc.

Accessing the heap specific portion of an IndexScanBatch could then be
something like
  ((char*) batch - iscan->batch_table_offset)
(wrapped in a helper, of course)

The index AM conversion could be made even cheaper with the above, because for
the index AM the offset would typically be known at compile time
(i.e. something like MAXALIGN(sizeof(BTIndexBatchData))).


> > > +     /*
> > > +      * If we are doing an index-only scan, this is the tuple storage 
> > > workspace
> > > +      * for the matching tuples (tuples referenced by items[]).  It is 
> > > of size
> > > +      * BLCKSZ, so it can hold as much as a full page's worth of tuples.
> > > +      * currTuples points into the trailing portion of this allocation, 
> > > just
> > > +      * past items[].  It is NULL for plain index scans.
> > > +      */
> >
> > This was a bit hard to understand before, but in generic code it seems like 
> > it
> > is too specific to the way it was used in nbtree.
>
> How so?

BLCKSZ might not mean anything to a non block oriented index am. Or BLCKSZ
might not be sufficient for an AM that compresses index entries.


> > Hm, it's somewhat confusing that we have this now, but still have
> > ->index_fetch_tuple() etc.
>
> I don't see a way around that. Not as long as there's a need to
> support table_index_fetch_tuple_check, which is fundamentally about
> passing a TID owned by the caller.

Hm. It looks like the only remaining users (_bt_check_unique(),
unique_key_recheck()) don't care about actually getting the tuple, they just
want to check if it exists.

For those the API is actually unnecessarily expensive, we don't need to return
the tuple, therefore don't need to create a slot.  So once we get the main
stuff in, I think we should evolve the API to be a bit more catered
specifically to this usecase.  I think it'll be a lot less confusing that way.


> > Don't really have an opinion about this, but it's not clear to me why this
> > patch changed this?
>
> Really? You actually suggested it (albeit in a completely different
> context). It just felt like a good idea to make the allocation
> dynamic, now that I'm adding more stuff to the instrumentation struct.
> I can change it back, if you prefer.
>
> > If we want to change it, could we do that in a prereq patch?
>
> Sure.

I think it's a sane change, it just didn't seem really related with the
commit.


> > This seems independent too. Let's just introduce xs_blk and do this change
> > separately? The we can get it out of the way?
>
> Will do. The next revision will put these changes into their own patch.

Thanks.


> > > + * so the clearing of the VM bit by a delete is not serialized with this 
> > > test
> > > + * below, and we may see a value that is significantly stale.  However, 
> > > we
> > > + * don't care about the delete right away, because the tuple is still 
> > > visible
> > > + * until the deleting transaction commits or the statement ends (if it's 
> > > our
> > > + * transaction).  In either case, the lock on the VM buffer will have 
> > > been
> > > + * released (acting as a write barrier) after clearing the bit.  And for 
> > > us to
> > > + * have a snapshot that includes the deleting transaction (making the 
> > > tuple
> > > + * invisible), we must have acquired ProcArrayLock after that time, 
> > > acting as
> > > + * a read barrier.
> > > + *
> > > + * It's worth going through this complexity to avoid needing to lock the 
> > > VM
> > > + * buffer, which could cause significant contention.
> > > + */
> >
> > If we are concerned about read side ordering, it seems a memory barrier on 
> > the
> > read side would suffice (and would not cause contention).
> >
> > There's a much more recent memory barrier already than the ProcArrayLock
> > during snapshot acquisition - the pinning of the visibilitymap is a barrier.
> >
> > Which is good, because we should really remove the ProcArrayLock acquisition
> > when we are able to reuse snapshots... And it'd be nasty if that introduced 
> > a
> > hard to hit memory ordering issue.
>
> Once again, these are direct copies of the existing comments.

Sorry for missing that, I had just looked at the new code, given the size of
the commit...


> Can you suggest a replacement comment block that comprehensively addresses
> all your concerns? I think that these are all independent issues. As far as
> I know there's no reason to think that the rules in this area have changed
> (or need to change) due to our structural/layering revisions.

I'll try.




> > Hm. It'll be pretty common for the block number of two consecutive items to 
> > be
> > on the same page.  The comments here suggest that the VM lookup is a 
> > relevant
> > performance factor, so why not check if the last VM_ALL_VISIBLE() was for 
> > the
> > same block?
> >
> > While visibilitymap_get_status() isn't particularly expensive, it still is 
> > at
> > least two external function calls (visibilitymap_get_status() and
> > BufferGetBlockNumber()) and some nontrivial math.
> >
> >
> > Heh, indeed: A quick hacky implementation makes
> >   SELECT count(*) FROM pgbench_accounts
> > about 16% faster.
>
> I'm aware of several ways to speed up the bulk VM lookups, but I held
> off because it didn't seem particularly crucial to this patch. For
> example, Matthias's bug fix for GiST index-only scans actually adds
> bulk VM lookups that use SIMD instructions.

To be clear, I was just suggesting to elide repeated VM lookups for index
entries that all point to the same table page. Not to actually do any batching
or such. It's just four more lines or so.

It seems somewhat related, in that the patch opens up the possibility of doing
more VM lookups than before, e.g. due to an IOS on the inner side of an
anti-join nestloop.


> Seems like independent work? I can add a patch like this if you think
> that it's important.

I think it's worth doing, given the simplicity of it and that this is all new
code. Having a bit more of an efficiency buffer against regressions is nice
too.



> > > +     /*
> > > +      * It's safe to drop the batch's buffer pin as soon as we've 
> > > resolved the
> > > +      * visibility status of all of its items
> > > +      */
> > > +     if (allbatchitemvisible && scan->MVCCScan)
> > > +     {
> > > +             Assert(batch->visInfo[batch->firstItem] & 
> > > BATCH_VIS_CHECKED);
> > > +             Assert(batch->visInfo[batch->lastItem] & BATCH_VIS_CHECKED);
> > > +
> > > +             ReleaseBuffer(batch->buf);
> > > +             batch->buf = InvalidBuffer;
> > > +     }
> >
> > It doesn't seem like it can be right that heapam releases an index buffer
> > directly here.
>
> I think that it makes perfect sense that the AM table controls this.
> After all, holding on to buffer pins exists purely for heapam's
> benefit -- it has nothing to do with index AM considerations.

I think it's right that the table AM controls *when* the pins are released,
what doesn't seem right that it knows *which* specific buffer being released
is the right thing and that's done.  What if the index AM is one outside of
postgres' buffer pool?


> > What, e.g., if the indexam requires multiple buffers to be
> > pinned for correctness?
>
> I agree that's not unusual. In fact, we often require this within hash
> index scans.
>
> That is handled in the later hash index patch by calling
> IncrBufferRefCount against a to-be-returned batch's buffer just before
> it is returned, as needed. This seems like a fairly elegant solution.
> It allows the index AM to hold its own pin reference, which can be
> released either before or after the table AM's reference, without any
> special care.

That helps with multiple pins for the same buffer, but not having to pin
multiple *different* buffers.  Sure, the index AM could just hold onto the
additional buffers until the scan ends or such, but that's not really an
option.

I really think that the generic code in indexbatch.c or the heapam code has
absolutely no business itself mucking around with index AM buffers. That would
be a huge layering violation.


> > > +static pg_attribute_hot bool
> > > +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> > > +                                               TupleTableSlot *slot)
> > > +{
>
> > It might be worth putting this function into a static inline with a
> > usebatchring argument, and calling it with a constant true/false from the
> > top-level of heapam_index_getnext_slot().  Right now the code is a fair bit
> > less tight due to both branches being supported inside the loop.
> >
> > Or even just storing scan->usebatchring in a local var (and passing it to
> > index_fetch_heap()), so the compiler at least can realize that it won't 
> > change
> > between iterations.
>
> I experimented with these ideas yesterday, but for whatever reason
> didn't see much of any win. I'll revisit it, though.
>
> > > +                             tid = index_getnext_tid(scan, direction);
> > > +
> > > +                             if (tid != NULL && scan->xs_want_itup)
> > > +                                     scan->xs_visible = 
> > > VM_ALL_VISIBLE(scan->heapRelation,
> > > +                                                                         
> > >                               ItemPointerGetBlockNumber(tid),
> > > +                                                                         
> > >                               &hscan->vmbuf);
> >
> > I'm a bit confused about the need for xs_visible (and also xs_heaptid, but
> > that's old).  Afaict it's just used to store very transient information that
> > could just be passed around upwards?
>
> Yes. I agree that that's a little weird.
>
> > Storing and then shortly after reading data can lead to increased latency 
> > due
> > to store-forwarding. And reading from memory is more expensive than just 
> > doing
> > it via the register. And the compiler can't optimize the memory writes out 
> > in
> > this kinda code, because it won't be able to proof it's never read later.
>
> Again, I didn't see much success when I experimented with this
> yesterday. But it could be a question of needing the right query to
> see any benefit.

I'd say that even if it were not to improve perf (pretty sure I saw some
though), not introducing fields that have more complicated lifetimes is just
good for human understanding too.


Greetings,

Andres Freund

Re: index prefetching

Reply via email to