Re: index prefetching

Peter Geoghegan Sat, 28 Feb 2026 11:43:30 -0800

On Sat, Feb 28, 2026 at 1:01 PM Andres Freund <[email protected]> wrote:
> > @@ -977,6 +978,49 @@ ExecSetTupleBound(int64 tuples_needed, PlanState 
> > *child_node)
> >
> >               ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
> >       }
> > +     else if (IsA(child_node, IndexScanState))
> > +     {
> > +             /*
> > +              * If it is an IndexScan, save the tuples_needed in the state 
> > so it
> > +              * can be propagated to the IndexScanDesc when the scan is 
> > started.
> > +              *
> > +              * Note: As with Sort, the index scan node is responsible for 
> > reacting
> > +              * properly to changes to this parameter.
> > +              */
> > +             IndexScanState *isstate = (IndexScanState *) child_node;
> > +
> > +             isstate->iss_TuplesNeeded = tuples_needed;
>
> I think this would need a comment explaining that, in contrast to e.g. the
> limit in a sort node, tuples_needed here is just an approximation, not precise
> (due to non-index quals).


Will fix.

> I suspect this isn't quite correct as-is. Consider a semi join nestloop with a
> post-join qual. In that case we might pull more nodes from the outer side than
> the tuple limit. Which would not necessarily work if e.g. the outer side is
> implemented as a top-n sort.

Will fix.

> Another issue is that the, I think, most potentially wasteful case for index
> prefetching is actually a nestloop anti join doing a lot of unnecessary
> prefetching on the *inner* side, because it can happen many times in a row.
> But right now we don't use ExecSetTupleBound() for such cases...

Yeah, that design choice was fairly arbitrary. Will look into passing
ExecSetTupleBound on the inner side, too.

> Wonder if we should expand ExecSetTupleBound() with an additional argument
> indicating whether the passed down limit is a hard limit (so it can be used by
> e.g. bounded sort) or a soft limit (in which case it only would be used to
> limit prefetching).
>
>
> If we had something like that, and used it for hinting anit-join nestloops,
> merge joins, ... we could pass that knowledge down to index scans.

Sounds like a good idea.

> I suspect that most of what yielding addresses could then be addressed by a
> read stream flag forcing read stream's distance to increase much more slowly,
> and perhaps also bounded by a lower distance.  Something like a
> READ_STREAM_SLOW_START flag at creation and/or read_stream_limit_distance()?

I'll experiment with this before the next revision. The main goal will
be to completely remove the current yielding mechanism.

> This made me look at some of the surrounding code again. And I suspect we
> might have a bug with unlogged tables (independent of this patch, just related
> due to the fake lsn infra).

> The consequences seem luckily limited, I don't think anything today uses
> bulkread strategies (which are the only ones where StrategyRejectBuffer()
> returns false if a flush is needed) with gist or such.  But it'd be completely
> reasonable for XLogNeedsFlush() to have assertions about the LSN being in
> range, we just don't today.

Noted.

> These changes afaict are unrelated to the subject of the commit?

They are needed to determine which buffers require an LSN change in
the unlogged case, ensuring consistency with the logged case. I guess
using the WAL record struct's fields for the unlogged case didn't feel
right.

> Seems like this should be pulled to before the if (RelationNeedsWAL(rel))
> so we don't have duplicate, potentially diverging, logic?

Will fix.

> > From a99a3f0576c2c055c269e5d24a5d8075db0b66e7 Mon Sep 17 00:00:00 2001
> > From: Peter Geoghegan <[email protected]>
> > Date: Tue, 25 Nov 2025 18:03:15 -0500
> > Subject: [PATCH v11 12/12] Make hash index AM use amgetbatch interface.
>
> > Replace hashgettuple with hashgetbatch, a function that implements the
> > new amgetbatch interface.  Plain index scans of hash indexes now return
> > matching items in batches consisting of all of the matches from a given
> > bucket or overflow page.  This gives the core executor the ability to
> > perform optimizations like index prefetching during hash index scans.
> >
> > Note that hash index scans will now drop index page buffer pins eagerly
> > (actually, the table AM will do so on behalf of the hash index AM).
>
> I wonder if that part should be pulled out of this commit? I think that very
> well could be committed earlier.

Do you mean I should create a new commit equivalent in scope to the
nbtree "Use fake LSNs to improve nbtree dropPin behavior", that
applies to hash indexes?

That is somewhat more complicated with the hash index AM, because it
currently lacks dropPin behavior. It requires adding something like
nbtree's _bt_drop_lock_and_maybe_pin, finding a place to set the LSN
only when it makes sense to, etc.

-- 
Peter Geoghegan

Re: index prefetching

Reply via email to