Re: index prefetching

Tomas Vondra Wed, 09 Jul 2025 15:55:59 -0700

Hi,

I got pinged about issues (compiler warnings, and some test failures) in
the simple patch version shared in May. So here's a rebased and cleaned
up version addressing that, and a couple additional issues I ran into.

FWIW if you run check-world on this, you may get failures in io_workers
TAP test. That's a pre-existing issue [1], the patch just makes it
easier to hit as it (probably) added AIO in some part of the test.

Otherwise it should pass all tests (and it does for me on CI).

The main changes in the patches and remaining questions:

(1) fixed compiler warnings

These were mostly due to contrib/test AMs with not-updated ambeginscan()
implementations.

(2) GiST fixes

I fixed a bug in how the prefetching handled distances, leading to
"tuples returned out of order" errors. It did not copy the Datums when
batching the reordered values, not realizing it may be FLOAT8, and on
32-bit systems the Datum is just a pointer. Fixed by datumCopy().

I'm not aware of any actual bug in the GiST code, but I'm sure the
memory management there is sketchy and likely leaks memory. Needs some
more thought and testing. The SP-GiST may have similar issues.

(3) ambeginscan(heap, index, ....)

I originally undid the changes to ambeginscan(), i.e. the callback was
restored back to what master has. To to create the ReadStream the AM
needs the heap, but it could build Relation using index->rd_index->indrelid.

That worked, but I did not like it for two reasons. The AM then needs to
manage the relation (close it etc.). And there was no way to know when
ambeginscan() gets called for a bitmap scan, in which case the
read_stream is unnecessary/useless. So it got created, but never used.
Not very expensive, but messy.

So I ended up restoring the ambeginscan() change, i.e. it now gets the
heap relation. I ended up passing it as the first argument, mostly for
consistency with index_beginscan(), which also does (heap, index, ...).

I renamed the index argument from 'rel' to 'index' in a couple of the
indexes, it was confusing to have 'heap' and 'rel'.

(4) lastBlock

I added the optimization to not queue duplicate block numbers, i.e. if
the index returns a sequence of TIDs from the same block, we skip
queueing that and simply use the buffer we already have. This is quite a
bit more efficient.

This is something the read_next callback in each AM needs to do, but
it's pretty simple.

(5) xs_visible

The current patch expects the AM to set the xs_visible even if it's not
using ReadStream (which is required to do that in the callback). If the
AM does not do that, index-only scans are broken.

But it occurs to me we could handle this in index_getnext_tid(). If the
AM does not use a ReadStream (xs_rs==NULL), we can check the VM and
store the value in xs_visible. It'd need moving the vmBuffer to the scan
descriptor (it's now in IndexOnlyScanState), but that seems OK. And the
AMs now add the buffer anyway.

(6) SGML

I added a couple paragraphs to indexam.sgml, documenting the new heap
argument, and also requirements from the read_next callback (e.g. the
lastBlock and xs_visible setting).

(7) remaining annoyances

There's a couple things that still annoy me - the "read_next" callbacks
are very similar, and duplicate a fair amount of code to stuff they're
required to. There's a little bit AM-specific code to get the next item
from the ScanOpaque structs, and then code to skip duplicate block
numbers and check the visibility map (if needed).

I believe both of these things could be refactored into some shared
place. The AMs would just call a function from indexam.c (which seems OK
from layering POV, and there's plenty of such calls).

I believe the same place could also act as the "scan manager" component
managing the prefetching (and related stuff?), as suggested by Peter
Geoghegan some time ago.

I ran out of time to work on this today, but I'll look into this soon.

FWIW I'm still planning to work on the "complex" patch version and see
if it can be moved forward. I've been having some very helpful chats
about this with Peter Geoghegan, and I'm still open to the possibility
of making it work. This simpler version is partially a hedge to have at
least something in case the complex patch does not make it.

regards

[1]
https://www.postgresql.org/message-id/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg%40357aaf6u525j

--
Tomas Vondra

From b0b775d1f8e29be04c65c4030dbe0a38b5436dbe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 1 May 2025 20:22:55 +0200
Subject: [PATCH v20250709 1/6] index prefetch infrastructure

Extends the AM interface to allow optional use of ReadStream for heap
access. The main changes are:

* The AM may create the stream, and set it into scan->xs_rs, which
  indicates the heap should be accessed this way. For this the AM needs
  to also define the read_next callback, accessing the private "opaque"
  part of the scan state.

  Note: This patch does not implement this for any existing AMs, beyond
  updating the signature. That comes in later patches.

* To create a read_stream, the ambeginscan() callback gets that as the
  first argument not.

* The indexam.c then passes scan->xs_rs to index_fetch_begin(). If the
  pointer is NULL, the regular ReadBuffer() interface will be used.

* A new GUC "enable_indexscan_prefetch" (default: on) is introduced, to
  make experimentation easier. Not sure we want to keep this.

* The executor layer is almost untouched, except for index-only scans.
  The read_next callback needs to skip all-visible pages (we don't want
  to prefetch those), and the two sides (read_next and index_fetch_heap)
  need to have the same visibility result, even if the VM gets updated
  in between. The new scan->xs_visible flag is used to pass this to the
  executor.

* The AM has to determine the block visibility before returning the
  block from the read_next callback (and not return it if it's all
  visible). And it needs to remember the value and use it for when
  returning the TID/tuple from amgettuple. These two places have to
  agree on the visibility.

* The xs_visible has to be set even by AMs that don't support the
  read_stream. Maybe this should be rethought, and indexam.c should do
  that when xs_rs=NULL.

* The read_next callback must not return duplicate blocks, i.e. blocks
  that are exactly the same as the last returned block. That's break the
  optimization that we don't read/pin blocks unnecessarily.
---
 contrib/bloom/bloom.h                         |  2 +-
 contrib/bloom/blscan.c                        |  4 +-
 doc/src/sgml/indexam.sgml                     | 35 ++++++-
 src/backend/access/brin/brin.c                |  8 +-
 src/backend/access/gin/ginscan.c              |  4 +-
 src/backend/access/gist/gistscan.c            |  4 +-
 src/backend/access/hash/hash.c                |  4 +-
 src/backend/access/heap/heapam_handler.c      | 94 ++++++++++++++++++-
 src/backend/access/index/genam.c              |  1 +
 src/backend/access/index/indexam.c            | 21 +++--
 src/backend/access/nbtree/nbtree.c            |  6 +-
 src/backend/access/spgist/spgscan.c           | 12 +--
 src/backend/access/table/tableam.c            |  2 +-
 src/backend/commands/constraint.c             |  3 +-
 src/backend/executor/nodeIndexonlyscan.c      | 10 +-
 src/backend/optimizer/path/costsize.c         |  1 +
 src/backend/storage/buffer/bufmgr.c           | 40 ++++++++
 src/backend/utils/misc/guc_tables.c           | 10 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/amapi.h                    |  3 +-
 src/include/access/brin_internal.h            |  3 +-
 src/include/access/gin_private.h              |  3 +-
 src/include/access/gistscan.h                 |  3 +-
 src/include/access/hash.h                     |  3 +-
 src/include/access/nbtree.h                   |  3 +-
 src/include/access/relscan.h                  |  4 +
 src/include/access/spgist.h                   |  3 +-
 src/include/access/tableam.h                  | 12 ++-
 src/include/optimizer/cost.h                  |  1 +
 src/include/storage/bufmgr.h                  |  2 +
 .../modules/dummy_index_am/dummy_index_am.c   |  4 +-
 src/test/regress/expected/sysviews.out        |  3 +-
 32 files changed, 254 insertions(+), 55 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 648167045f4..00d8be39953 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -190,7 +190,7 @@ extern bool blinsert(Relation index, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc blbeginscan(Relation heap, Relation index, int nkeys, int norderbys);
 extern int64 blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void blrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index d072f47fe28..ba84bf6a0c5 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -22,12 +22,12 @@
  * Begin scan of bloom index.
  */
 IndexScanDesc
-blbeginscan(Relation r, int nkeys, int norderbys)
+blbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	BloomScanOpaque so;
 
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	so = (BloomScanOpaque) palloc(sizeof(BloomScanOpaqueData));
 	initBloomState(&so->state, scan->indexRelation);
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 1aa4741a8ea..b15bb241f5b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -657,7 +657,8 @@ amadjustmembers (Oid opfamilyoid,
   <para>
 <programlisting>
 IndexScanDesc
-ambeginscan (Relation indexRelation,
+ambeginscan (Relation heapRelation,
+             Relation indexRelation,
              int nkeys,
              int norderbys);
 </programlisting>
@@ -674,6 +675,38 @@ ambeginscan (Relation indexRelation,
    the interesting parts of index-scan startup are in <function>amrescan</function>.
   </para>
 
+  <para>
+   The index scan may opt into asynchronous I/O by initializing a <literal>ReadStream</literal>
+   on the provided <literal>heapRelation</literal>, and storing it in <literal>xs_rs</literal>
+   field of the scan descriptor. If the field is left <literal>NULL</literal>, the synchronous
+   buffer API will be used. The <literal>heapRelation</literal> is left <literal>NULL</literal>
+   for bitmaps scans, which access it from a separate node.
+  </para>
+
+  <para>
+    To create the <literal>ReadStream</literal>, the index has to implement a <literal>read_next</literal>
+    callback, returning a sequence of block numbers. The scan has to be split into multiple
+    partial sequences (e.g. one sequence per leaf page), the index has to reset the stream
+    when advancing to the next leaf page.
+  </para>
+
+  <para>
+    If the index supports index-only-scans, it needs to set the <literal>xs_visible</literal>
+    field when returning an item from <literal>amgettuple</literal>. This value has
+    to be determined in the <literal>read_next</literal> callback and remembered,
+    and the callback must not queue all-visible blocks. Otherwise the queued blocks
+    and read blocks might disagree. The VM may be updated at any point, so a block
+    might be queued, but then not read as it's all-visible. Or vice versa. Regular
+    (non-IOS) scans don't need to worry about this.
+  </para>
+
+  <para>
+    The <literal>read_next</literal> callback must not queue runs of the same block.
+    If the block number is the same as the last returned block, it has to be
+    skipped. If the stream is reset (e.g. when advancing to the next leaft
+    page), the last block is forgotten.
+  </para>
+
   <para>
 <programlisting>
 void
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4204088fa0d..31222e5a96d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -536,16 +536,16 @@ brininsertcleanup(Relation index, IndexInfo *indexInfo)
  * holding lock on index, it's not necessary to recompute it during brinrescan.
  */
 IndexScanDesc
-brinbeginscan(Relation r, int nkeys, int norderbys)
+brinbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	BrinOpaque *opaque;
 
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	opaque = palloc_object(BrinOpaque);
-	opaque->bo_rmAccess = brinRevmapInitialize(r, &opaque->bo_pagesPerRange);
-	opaque->bo_bdesc = brin_build_desc(r);
+	opaque->bo_rmAccess = brinRevmapInitialize(index, &opaque->bo_pagesPerRange);
+	opaque->bo_bdesc = brin_build_desc(index);
 	scan->opaque = opaque;
 
 	return scan;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index c2d1771bd77..4e11ed9626d 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -22,7 +22,7 @@
 
 
 IndexScanDesc
-ginbeginscan(Relation rel, int nkeys, int norderbys)
+ginbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	GinScanOpaque so;
@@ -30,7 +30,7 @@ ginbeginscan(Relation rel, int nkeys, int norderbys)
 	/* no order by operators allowed */
 	Assert(norderbys == 0);
 
-	scan = RelationGetIndexScan(rel, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	/* allocate private workspace */
 	so = (GinScanOpaque) palloc(sizeof(GinScanOpaqueData));
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..d8ba7f7eff5 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -71,14 +71,14 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
  */
 
 IndexScanDesc
-gistbeginscan(Relation r, int nkeys, int norderbys)
+gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	GISTSTATE  *giststate;
 	GISTScanOpaque so;
 	MemoryContext oldCxt;
 
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	/* First, set up a GISTSTATE with a scan-lifespan memory context */
 	giststate = initGISTstate(scan->indexRelation);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..2133e454e9b 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -371,7 +371,7 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	hashbeginscan() -- start a scan on a hash index
  */
 IndexScanDesc
-hashbeginscan(Relation rel, int nkeys, int norderbys)
+hashbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	HashScanOpaque so;
@@ -379,7 +379,7 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	/* no order by operators allowed */
 	Assert(norderbys == 0);
 
-	scan = RelationGetIndexScan(rel, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	HashScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..a16aa3e56ae 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -79,13 +79,16 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, ReadStream *rs)
 {
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_cbuf = InvalidBuffer;
 
+	/* XXX Maybe the stream should be in IndexFetchHeapData instead? */
+	hscan->xs_base.rs = rs;
+
 	return &hscan->xs_base;
 }
 
@@ -129,16 +132,99 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
+		bool		release_prev_buf = true;
+
+		/*
+		 * XXX We should compare the previous/this block, and only do the read
+		 * if the blocks are different (and reuse the buffer otherwise). But the
+		 * index AMs would need to do exactly the same thing, to keep both sides
+		 * of the queue in sync.
+		 */
+
+		/*
+		 * If the scan is using read stream, get the block from it. If not,
+		 * use the regular buffer read.
+		 */
+		if (scan->rs)
+		{
+			/*
+			 * If we're trying to read the same block as the last time, don't
+			 * try reading it from the stream again, but just return the last
+			 * buffer. We need to check if the previous buffer is still pinned
+			 * and contains the correct block (it might have been unpinned,
+			 * used for a different block, so we need to be careful).
+			 *
+			 * The places scheduling the blocks (read_next callbacks) need to
+			 * do the same thing and not schedule the blocks if it matches the
+			 * previous one. Otherwise the stream will get out of sync, causing
+			 * confusion.
+			 *
+			 * This is what ReleaseAndReadBuffer does too, but it does not
+			 * have a queue of requests scheduled from somewhere else, so it
+			 * does not need to worry about that.
+			 *
+			 * XXX Maybe we should remember the block in IndexFetchTableData,
+			 * so that we can make the check even cheaper, without looking at
+			 * the buffer descriptor? But that assumes the buffer was not
+			 * unpinned (or repinned) elsewhere, before we got back here. But
+			 * can that even happen? If yes, I guess we shouldn't be releasing
+			 * the prev buffer anyway.
+			 *
+			 * XXX This has undesired impact on prefetch distance. The read
+			 * stream schedules reads for a certain number of future blocks,
+			 * but if we skip duplicate blocks, the prefetch distance may get
+			 * unexpectedly large (e.g. for correlated indexes, with long runs
+			 * of TIDs from the same heap page). We're however limited to items
+			 * from a single leaf page.
+			 *
+			 * XXX What if we pinned the buffer twice (increase the refcount),
+			 * so that if the caller unpins the buffer, we still keep the
+			 * second pin. Wouldn't that mean we don't need to worry about the
+			 * possibility someone loaded another page into the buffer?
+			 *
+			 * XXX We might also keep a longer history of recent blocks, not
+			 * just the immediately preceding one. But that makes it harder,
+			 * because the two places (read_next callback and here) need to
+			 * have a slightly different view.
+			 */
+			if (BufferMatches(hscan->xs_cbuf,
+							  hscan->xs_base.rel,
+							  ItemPointerGetBlockNumber(tid)))
+				release_prev_buf = false;
+			else
+				hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		}
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* We should always get a valid buffer for a valid TID. */
+		Assert(BufferIsValid(hscan->xs_cbuf));
+
+		/*
+		 * Did we read the expected block number (per the TID)?
+		 *
+		 * For the regular buffer reads this should always match, but with the
+		 * read stream it might disagree due to a bug / subtle difference in the
+		 * read_next callback.
+		 */
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
 		if (prev_buf != hscan->xs_cbuf)
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+
+		/*
+		 * The read stream does not release the buffer, the caller is expected
+		 * to do that (unlike ReleaseAndReadBuffer). But that would mean the
+		 * behavior with/without read stream is different, and the contract for
+		 * index_fetch_tuple would change. So we release the old bufffer here.
+		 */
+		if (scan->rs && (prev_buf != InvalidBuffer) && release_prev_buf)
+			ReleaseBuffer(prev_buf);
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..10dc832d4f7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -125,6 +125,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_itupdesc = NULL;
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
+	scan->xs_rs = NULL;
 
 	return scan;
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..73b531a9eff 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -104,7 +104,7 @@ do { \
 			 CppAsString(pname), RelationGetRelationName(scan->indexRelation)); \
 } while(0)
 
-static IndexScanDesc index_beginscan_internal(Relation indexRelation,
+static IndexScanDesc index_beginscan_internal(Relation heapRelation, Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
@@ -263,7 +263,8 @@ index_beginscan(Relation heapRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(heapRelation, indexRelation,
+									nkeys, norderbys, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -274,7 +275,7 @@ index_beginscan(Relation heapRelation,
 	scan->instrument = instrument;
 
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, scan->xs_rs);
 
 	return scan;
 }
@@ -295,7 +296,7 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(NULL, indexRelation, nkeys, 0, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -311,7 +312,7 @@ index_beginscan_bitmap(Relation indexRelation,
  * index_beginscan_internal --- common code for index_beginscan variants
  */
 static IndexScanDesc
-index_beginscan_internal(Relation indexRelation,
+index_beginscan_internal(Relation heapRelation, Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
 						 ParallelIndexScanDesc pscan, bool temp_snap)
 {
@@ -331,8 +332,8 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(heapRelation, indexRelation,
+												nkeys, norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
@@ -593,8 +594,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
-	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+	scan = index_beginscan_internal(heaprel, indexrel, nkeys, norderbys,
+									snapshot, pscan, true);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -605,7 +606,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->instrument = instrument;
 
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, scan->xs_rs);
 
 	return scan;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c130..619b356e848 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -333,7 +333,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	btbeginscan() -- start a scan on a btree index
  */
 IndexScanDesc
-btbeginscan(Relation rel, int nkeys, int norderbys)
+btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -342,7 +342,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	Assert(norderbys == 0);
 
 	/* get the scan */
-	scan = RelationGetIndexScan(rel, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
@@ -371,7 +371,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
-	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->xs_itupdesc = RelationGetDescr(index);
 
 	scan->opaque = so;
 
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 25893050c58..655f5cdc1eb 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -301,13 +301,13 @@ spgPrepareScanKeys(IndexScanDesc scan)
 }
 
 IndexScanDesc
-spgbeginscan(Relation rel, int keysz, int orderbysz)
+spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 {
 	IndexScanDesc scan;
 	SpGistScanOpaque so;
 	int			i;
 
-	scan = RelationGetIndexScan(rel, keysz, orderbysz);
+	scan = RelationGetIndexScan(index, keysz, orderbysz);
 
 	so = (SpGistScanOpaque) palloc0(sizeof(SpGistScanOpaqueData));
 	if (keysz > 0)
@@ -330,7 +330,7 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 	 * most opclasses we can re-use the index reldesc instead of making one.)
 	 */
 	so->reconTupDesc = scan->xs_hitupdesc =
-		getSpGistTupleDesc(rel, &so->state.attType);
+		getSpGistTupleDesc(index, &so->state.attType);
 
 	/* Allocate various arrays needed for order-by scans */
 	if (scan->numberOfOrderBys > 0)
@@ -362,14 +362,14 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 	}
 
 	fmgr_info_copy(&so->innerConsistentFn,
-				   index_getprocinfo(rel, 1, SPGIST_INNER_CONSISTENT_PROC),
+				   index_getprocinfo(index, 1, SPGIST_INNER_CONSISTENT_PROC),
 				   CurrentMemoryContext);
 
 	fmgr_info_copy(&so->leafConsistentFn,
-				   index_getprocinfo(rel, 1, SPGIST_LEAF_CONSISTENT_PROC),
+				   index_getprocinfo(index, 1, SPGIST_LEAF_CONSISTENT_PROC),
 				   CurrentMemoryContext);
 
-	so->indexCollation = rel->rd_indcollation[0];
+	so->indexCollation = index->rd_indcollation[0];
 
 	scan->opaque = so;
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..be8e02a9c45 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -217,7 +217,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221f2..31279bd82b1 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan
+			= table_index_fetch_begin(trigdata->tg_relation, NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca9507..ffebdfa7abd 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -157,10 +157,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX This expects the index AM to set the xs_visible flag. Maybe we
+		 * should not assume that. The subsequent patches do that for all the
+		 * built in AMs, but until that time it's effectively broken - the AMs
+		 * currently do not set xs_visible. Maybe we should only use this if
+		 * the AM uses ReadStream, and call VM_ALL_VISIBLE otherwise?
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!scandesc->xs_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1f04a2c182c..f030a34cc62 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bd68d7e0ca9..2a69d6c3ec8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3045,6 +3045,46 @@ ReleaseAndReadBuffer(Buffer buffer,
 	return ReadBuffer(relation, blockNum);
 }
 
+/*
+ * BufferMatches
+ *		Check if the buffer (still) contains the expected page.
+ *
+ * Check if the buffer contains the expected page. The buffer may be invalid,
+ * or valid and pinned.
+ */
+bool
+BufferMatches(Buffer buffer,
+			  Relation relation,
+			  BlockNumber blockNum)
+{
+	ForkNumber	forkNum = MAIN_FORKNUM;
+	BufferDesc *bufHdr;
+
+	if (BufferIsValid(buffer))
+	{
+		Assert(BufferIsPinned(buffer));
+		if (BufferIsLocal(buffer))
+		{
+			bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+		else
+		{
+			bufHdr = GetBufferDescriptor(buffer - 1);
+			/* we have pin, so it's ok to examine tag without spinlock */
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+	}
+
+	return false;
+}
+
 /*
  * PinBuffer -- make buffer unavailable for replacement.
  *
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 511dc32d519..7aa782aa070 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -809,6 +809,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_prefetch", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables prefetching for index-scans."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_prefetch,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 341f88adc87..c048bb03f30 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 52916bab7a3..1ef3daee885 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -181,7 +181,8 @@ typedef void (*amadjustmembers_function) (Oid opfamilyoid,
 										  List *functions);
 
 /* prepare for index scan */
-typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
+typedef IndexScanDesc (*ambeginscan_function) (Relation heapRelation,
+											   Relation indexRelation,
 											   int nkeys,
 											   int norderbys);
 
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index d093a0bf130..186b4b43f72 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -97,7 +97,8 @@ extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
 extern void brininsertcleanup(Relation index, struct IndexInfo *indexInfo);
-extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc brinbeginscan(Relation heap, Relation index,
+								   int nkeys, int norderbys);
 extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index aee1f70c22e..45f19594c79 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -393,7 +393,8 @@ typedef struct GinScanOpaqueData
 
 typedef GinScanOpaqueData *GinScanOpaque;
 
-extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc ginbeginscan(Relation heap, Relation index,
+								  int nkeys, int norderbys);
 extern void ginendscan(IndexScanDesc scan);
 extern void ginrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gistscan.h b/src/include/access/gistscan.h
index 518034c36d5..372f8be5817 100644
--- a/src/include/access/gistscan.h
+++ b/src/include/access/gistscan.h
@@ -16,7 +16,8 @@
 
 #include "access/amapi.h"
 
-extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc gistbeginscan(Relation heap, Relation index,
+								   int nkeys, int norderbys);
 extern void gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 					   ScanKey orderbys, int norderbys);
 extern void gistendscan(IndexScanDesc scan);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 073ad29b19b..6befa3ebf60 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -370,7 +370,8 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc hashbeginscan(Relation heap, Relation index,
+								   int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0afe..e6e52210b15 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1188,7 +1188,8 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan(Relation heap, Relation index,
+								 int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..56e6c6245e5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -19,6 +19,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,6 +122,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
@@ -168,11 +170,13 @@ typedef struct IndexScanDescData
 	struct TupleDescData *xs_itupdesc;	/* rowtype descriptor of xs_itup */
 	HeapTuple	xs_hitup;		/* index data returned by AM, as HeapTuple */
 	struct TupleDescData *xs_hitupdesc; /* rowtype descriptor of xs_hitup */
+	bool		xs_visible;		/* heap page is all-visible */
 
 	ItemPointerData xs_heaptid; /* result */
 	bool		xs_heap_continue;	/* T if must keep walking, potential
 									 * further results */
 	IndexFetchTableData *xs_heapfetch;
+	ReadStream *xs_rs;			/* read_stream (if supported by the AM) */
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index cbe9b347d8f..69588d18124 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -203,7 +203,8 @@ extern bool spginsert(Relation index, Datum *values, bool *isnull,
 					  struct IndexInfo *indexInfo);
 
 /* spgscan.c */
-extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz);
+extern IndexScanDesc spgbeginscan(Relation heap, Relation index,
+								  int keysz, int orderbysz);
 extern void spgendscan(IndexScanDesc scan);
 extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..749a68ed861 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -413,8 +413,14 @@ typedef struct TableAmRoutine
 	 * structure with additional information.
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
+	 *
+	 * The 'rs' parameter is the read stream initialized by the AM, in
+	 * which case the read stream has to be used to fetch tuples. If the
+	 * AM does not support read stream, it's set to NULL and the regular
+	 * synchronous API to read buffers shall be used.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  ReadStream *rs);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -1149,9 +1155,9 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, ReadStream *rs)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, rs);
 }
 
 /*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8f3..00f4c3d0011 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..3b7d4e6a6a2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -237,6 +237,8 @@ extern void IncrBufferRefCount(Buffer buffer);
 extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
+extern bool BufferMatches(Buffer buffer, Relation relation,
+						  BlockNumber blockNum);
 
 extern Buffer ExtendBufferedRel(BufferManagerRelation bmr,
 								ForkNumber forkNum,
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6fc..622a8ed0757 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -241,12 +241,12 @@ divalidate(Oid opclassoid)
  * Begin scan of index AM.
  */
 static IndexScanDesc
-dibeginscan(Relation r, int nkeys, int norderbys)
+dibeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 
 	/* Let's pretend we are doing something */
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 	return scan;
 }
 
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 83228cfca29..3a7603b24e2 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -158,6 +158,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -172,7 +173,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(24 rows)
+(25 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.50.0

From 93d7675cb5f56629b948e3e186de7015d67c7070 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 1 May 2025 20:23:18 +0200
Subject: [PATCH v20250709 2/6] prefetch for btree indexes

Implements the bt_stream_read_next() callback, returning blocks from
the current BTScanOpaque.
---
 src/backend/access/nbtree/nbtree.c    | 162 ++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c |  46 ++++++++
 src/include/access/nbtree.h           |   9 ++
 3 files changed, 217 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 619b356e848..0fa4af79dac 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,9 +21,11 @@
 #include "access/nbtree.h"
 #include "access/relscan.h"
 #include "access/stratnum.h"
+#include "access/visibilitymap.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bulk_write.h"
 #include "storage/condition_variable.h"
@@ -329,6 +331,107 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	return ntids;
 }
 
+/*
+ * bt_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * With index-only scans this skips all-visible pages. The visibility info
+ * is stored, so that we can later pass it to the scan (we must not access
+ * the VM again, the bit might have changes, and the read stream would get
+ * out of sync (we'd get different blocks than we expect expect).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+bt_stream_read_next(ReadStream *stream,
+					void *callback_private_data,
+					void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+	ScanDirection	dir = so->currPos.dir;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->currPos.streamIndex == -1)
+		so->currPos.streamIndex = so->currPos.itemIndex;
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while ((so->currPos.streamIndex >= so->currPos.firstItem) &&
+		   (so->currPos.streamIndex <= so->currPos.lastItem))
+	{
+		ItemPointer		tid;
+		BTScanPosItem  *item;
+
+		item = &so->currPos.items[so->currPos.streamIndex];
+
+		tid = &item->heapTid;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!item->allVisibleSet)
+			{
+				item->allVisibleSet = true;
+				item->allVisible = VM_ALL_VISIBLE(scan->heapRelation,
+												  ItemPointerGetBlockNumber(tid),
+												  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (item->allVisible)
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item, assuming the current scan direction */
+		if (ScanDirectionIsForward(dir))
+		{
+			so->currPos.streamIndex++;
+		}
+		else
+		{
+			so->currPos.streamIndex--;
+		}
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 /*
  *	btbeginscan() -- start a scan on a btree index
  */
@@ -364,6 +467,12 @@ btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* buffer for accessing the VM in read_next callback */
+	so->vmBuffer = InvalidBuffer;
+
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -375,6 +484,27 @@ btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 
 	scan->opaque = so;
 
+	/*
+	 * Initialize the read stream too, to opt in into prefetching.
+	 *
+	 * XXX We create a stream depending on the GUC, and only if the heap rel
+	 * is provided. This means we don't initialize the stream even for bitmap
+	 * scans, which don't use it.
+	 *
+	 * XXX The table has to be already locked by the query, so NoLock. Too
+	 * bad the heapRelation does not get passed here.
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+												 NULL,
+												 heap,
+												 MAIN_FORKNUM,
+												 bt_stream_read_next,
+												 scan,
+												 0);
+	}
+
 	return scan;
 }
 
@@ -461,6 +591,14 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
+
+	/* reset the read stream, to start over */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 /*
@@ -483,6 +621,22 @@ btendscan(IndexScanDesc scan)
 	so->markItemIndex = -1;
 	BTScanPosUnpinIfPinned(so->markPos);
 
+	/* release the VM buffer */
+	if (so->vmBuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(so->vmBuffer);
+		so->vmBuffer = InvalidBuffer;
+	}
+
+	/*
+	 * XXX I wonder if maybe the stream should be managed by the indexam.c
+	 * layer, not by each index AM.
+	 */
+
+	/* terminate the read stream (and close the heap) */
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
+
 	/* No need to invalidate positions, the RAM is about to be freed. */
 
 	/* Release storage */
@@ -581,6 +735,14 @@ btrestrpos(IndexScanDesc scan)
 		else
 			BTScanPosInvalidate(so->currPos);
 	}
+
+	/* we're restored the scan position, reset the read stream */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4af1ff1e9e5..91e4fe5ec95 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -17,6 +17,7 @@
 
 #include "access/nbtree.h"
 #include "access/relscan.h"
+#include "access/visibilitymap.h"
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2045,6 +2046,21 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
+	/*
+	 * Reset the read stream, to restart it for the new page.
+	 *
+	 * XXX Maybe we should not reset prefetch distance to 0, but start from
+	 * a somewhat higher value. We're merely continuing the same scan as
+	 * before ... maybe reduce it a bit, to not harm LIMIT queries, but not
+	 * reset it all the way to 0.
+	 */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
+
 	return (so->currPos.firstItem <= so->currPos.lastItem);
 }
 
@@ -2059,6 +2075,11 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+
+	/* initialize visibility flags */
+	currItem->allVisibleSet = false;
+	currItem->allVisible = false;
+
 	if (so->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
@@ -2089,6 +2110,11 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 
 	currItem->heapTid = *heapTid;
 	currItem->indexOffset = offnum;
+
+	/* initialize visibility flags */
+	currItem->allVisibleSet = false;
+	currItem->allVisible = false;
+
 	if (so->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
@@ -2126,6 +2152,10 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 	currItem->heapTid = *heapTid;
 	currItem->indexOffset = offnum;
 
+	/* initialize visibility flags */
+	currItem->allVisibleSet = false;
+	currItem->allVisible = false;
+
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
@@ -2150,6 +2180,22 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
 
 	/* Return next item, per amgettuple contract */
 	scan->xs_heaptid = currItem->heapTid;
+
+	/*
+	 * XXX If this is index-only scan and we haven't checked the VM yet, do
+	 * that now. We need to make sure scan->xs_visible is set correctly even
+	 * if the scan is not using a read stream.
+	 */
+	if (scan->xs_want_itup && !currItem->allVisibleSet)
+	{
+		currItem->allVisibleSet = true;
+		currItem->allVisible
+			= VM_ALL_VISIBLE(scan->heapRelation,
+							 ItemPointerGetBlockNumber(&currItem->heapTid),
+							 &so->vmBuffer);
+	}
+
+	scan->xs_visible = currItem->allVisible;
 	if (so->currTuples)
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e6e52210b15..e307fd9bf7f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -957,6 +957,8 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 	ItemPointerData heapTid;	/* TID of referenced heap item */
 	OffsetNumber indexOffset;	/* index item's location within page */
 	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+	bool allVisibleSet;			/* did we set the VM flag already? */
+	bool allVisible;			/* VM info (for IOS) */
 } BTScanPosItem;
 
 typedef struct BTScanPosData
@@ -995,6 +997,7 @@ typedef struct BTScanPosData
 	int			firstItem;		/* first valid index in items[] */
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
+	int			streamIndex;	/* item returned to the read stream */
 
 	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
@@ -1067,6 +1070,9 @@ typedef struct BTScanOpaqueData
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
+	/* buffer for accessing VM in index-only scans */
+	Buffer		vmBuffer;
+
 	/* info about killed items if any (killedItems is NULL if never used) */
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
@@ -1089,6 +1095,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* last block returned by the read_next stream callback */
+	BlockNumber		lastBlock;
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
-- 
2.50.0

From 425ccad723c34ef8eeeb1d6d80400b1705c4ec2a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 25 Apr 2025 12:34:15 +0200
Subject: [PATCH v20250709 3/6] prefetch for hash indexes

Implements the hash_stream_read_next() callback, returning blocks from
HashScanOpaque.
---
 src/backend/access/hash/hash.c       | 105 +++++++++++++++++++++++++++
 src/backend/access/hash/hashsearch.c |  37 ++++++++++
 src/include/access/hash.h            |   8 ++
 3 files changed, 150 insertions(+)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2133e454e9b..0884f0e05d9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,12 +22,14 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/stratnum.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "optimizer/plancat.h"
 #include "pgstat.h"
 #include "utils/fmgrprotos.h"
@@ -366,6 +368,78 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	return ntids;
 }
 
+/*
+ * hash_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+hash_stream_read_next(ReadStream *stream,
+					  void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->currPos.streamIndex == -1)
+		so->currPos.streamIndex = so->currPos.itemIndex;
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but we might also skip duplicate blocks (in the future).
+	 */
+	while ((so->currPos.streamIndex >= so->currPos.firstItem) &&
+		   (so->currPos.streamIndex <= so->currPos.lastItem))
+	{
+		ItemPointer		tid;
+		HashScanPosItem  *item;
+
+		item = &so->currPos.items[so->currPos.streamIndex];
+
+		tid = &item->heapTid;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/* advance to the next item, depending on scan direction */
+		if (ScanDirectionIsForward(so->currPos.dir))
+		{
+			so->currPos.streamIndex++;
+		}
+		else
+		{
+			so->currPos.streamIndex--;
+		}
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 
 /*
  *	hashbeginscan() -- start a scan on a hash index
@@ -394,6 +468,25 @@ hashbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 
 	scan->opaque = so;
 
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
+
+	/*
+	 * Initialize the read stream, to opt-in into prefetching.
+	 *
+	 * XXX See comments in btbeginscan().
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+												 NULL,
+												 heap,
+												 MAIN_FORKNUM,
+												 hash_stream_read_next,
+												 scan,
+												 0);
+	}
+
 	return scan;
 }
 
@@ -425,6 +518,14 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
+
+	/* reset the stream, to start over */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 /*
@@ -449,6 +550,10 @@ hashendscan(IndexScanDesc scan)
 		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
+
+	/* terminate read stream */
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
 }
 
 /*
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 92c15a65be2..d5b045deb8c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -54,6 +54,28 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	bool		end_of_scan = false;
 
+	/*
+	 * We need to reset the read stream when the scan direction changes. Hash
+	 * indexes are not ordered, but there's still scrollable cursors, and those
+	 * do have irection. So handle that here, and also remember the direction,
+	 * so that the read_next callback can consider that.
+	 *
+	 * XXX we can't do that in the read_next callback, because we might have
+	 * already hit the end of the stream (returned InvalidBlockNumber), in
+	 * which case the callback won't be called.
+	 */
+	if (so->currPos.dir != dir)
+	{
+		so->currPos.dir = dir;
+
+		if (scan->xs_rs)
+		{
+			so->currPos.streamIndex = -1;
+			so->lastBlock = InvalidBlockNumber;
+			read_stream_reset(scan->xs_rs);
+		}
+	}
+
 	/*
 	 * Advance to the next tuple on the current page; or if done, try to read
 	 * data from the next or previous page based on the scan direction. Before
@@ -592,6 +614,21 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 		so->currPos.buf = InvalidBuffer;
 	}
 
+	/*
+	 * restart the stream for this page
+	 *
+	 * XXX Maybe we should not reset prefetch distance to 0, but start from
+	 * a somewhat higher value. We're merely continuing the same scan as
+	 * before ... maybe reduce it a bit, to not harm LIMIT queries, but not
+	 * reset it all the way to 0.
+	 */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = - 1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
+
 	Assert(so->currPos.firstItem <= so->currPos.lastItem);
 	return true;
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6befa3ebf60..3916cf746c6 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -113,6 +113,9 @@ typedef struct HashScanPosData
 	BlockNumber nextPage;		/* next overflow page */
 	BlockNumber prevPage;		/* prev overflow or bucket page */
 
+	/* scan direction for the saved position's call to _hash_next */
+	ScanDirection dir;
+
 	/*
 	 * The items array is always ordered in index order (ie, increasing
 	 * indexoffset).  When scanning backwards it is convenient to fill the
@@ -123,6 +126,7 @@ typedef struct HashScanPosData
 	int			firstItem;		/* first valid index in items[] */
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
+	int			streamIndex;	/* position of the read stream */
 
 	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
 } HashScanPosData;
@@ -150,6 +154,7 @@ typedef struct HashScanPosData
 		(scanpos).firstItem = 0; \
 		(scanpos).lastItem = 0; \
 		(scanpos).itemIndex = 0; \
+		(scanpos).dir = NoMovementScanDirection; \
 	} while (0)
 
 /*
@@ -182,6 +187,9 @@ typedef struct HashScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* last block returned by the read_next stream callback */
+	BlockNumber	lastBlock;
+
 	/*
 	 * Identify all the matching items on a page and save them in
 	 * HashScanPosData
-- 
2.50.0

From db429b6dd4a18f3cf5343f44c9f553328e697f13 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 25 Apr 2025 13:22:38 +0200
Subject: [PATCH v20250709 4/6] prefetch for gist indexes

Implements gist_stream_read_next() and gist_ordered_stream_read_next()
callbacks, for different types of scans:

* gist_stream_read_next() is for traditional index scans, returning
  blocks from GISTScanOpaque.

* gist_ordered_stream_read_next() is for scans with results ordered by
  distance, etc.

The ordered scans rely on a pairing heap - the items are fed into it,
but then are read and returned one by one. That would make prefetching
quite useless, so the patch introduces a small queue on top of the
pairing heap, and the items are loaded in batches. This is what
getNextNearestPrefetch() is responsible for.

Note: Right now the batches are always 32 items, which may regress
queries with LIMIT clauses, etc. It should start at 1 and gradually
increase the batch size. Similarly to how prefetch distance grows.

FIXME The memory management of the batches is almost certainly leaky,
needs to be cleaned up.
---
 src/backend/access/gist/gistget.c  | 160 +++++++++++++++++++-
 src/backend/access/gist/gistscan.c | 225 +++++++++++++++++++++++++++++
 src/include/access/gist_private.h  |  16 ++
 3 files changed, 400 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 387d9972345..7e292ebb442 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -21,7 +21,9 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/predicate.h"
+#include "utils/datum.h"
 #include "utils/float.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -395,6 +397,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 	}
 
 	so->nPageData = so->curPageData = 0;
+	so->streamPageData = -1;
 	scan->xs_hitup = NULL;		/* might point into pageDataCxt */
 	if (so->pageDataCxt)
 		MemoryContextReset(so->pageDataCxt);
@@ -460,6 +463,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 			so->pageData[so->nPageData].heapPtr = it->t_tid;
 			so->pageData[so->nPageData].recheck = recheck;
 			so->pageData[so->nPageData].offnum = i;
+			so->pageData[so->nPageData].allVisible = false;
+			so->pageData[so->nPageData].allVisibleSet = false;
 
 			/*
 			 * In an index-only scan, also fetch the data from the tuple.  The
@@ -496,6 +501,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 				item->data.heap.heapPtr = it->t_tid;
 				item->data.heap.recheck = recheck;
 				item->data.heap.recheckDistances = recheck_distances;
+				item->data.heap.allVisible = false;
+				item->data.heap.allVisibleSet = false;
 
 				/*
 				 * In an index-only scan, also fetch the data from the tuple.
@@ -589,6 +596,22 @@ getNextNearest(IndexScanDesc scan)
 			/* in an index-only scan, also return the reconstructed tuple. */
 			if (scan->xs_want_itup)
 				scan->xs_hitup = item->data.heap.recontup;
+
+			/*
+			 * If this is index-only scan, determine the VM status, so that
+			 * we can set xs_visible correctly.
+			 */
+			if (scan->xs_want_itup && ! item->data.heap.allVisibleSet)
+			{
+				item->data.heap.allVisibleSet = true;
+				item->data.heap.allVisible
+					= VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(&item->data.heap.heapPtr),
+									 &so->vmBuffer);
+			}
+
+			scan->xs_visible = item->data.heap.allVisible;
+
 			res = true;
 		}
 		else
@@ -605,6 +628,119 @@ getNextNearest(IndexScanDesc scan)
 	return res;
 }
 
+/*
+ * A variant of getNextNearest() that stashes the items into a small buffer, so
+ * that the prefetching can work (getNextNearest returns items one by one).
+ *
+ * XXX Uses a small secondary queue, because getNextNearest() may be modifying
+ * the regular pageData[] buffer.
+ */
+static bool
+getNextNearestPrefetch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+	GISTSearchHeapItem *item;
+
+	/* did we use all items from the queue */
+	if (so->queueItem == so->queueUsed)
+	{
+		/* grow the number of items */
+		int		maxitems = Min(Max(1, 2 * so->queueUsed), 32);
+
+		/* FIXME gradually incresse the number of items, not 32 all the time */
+		maxitems = 32;
+
+		so->queueItem = 0;
+		so->queueUsed = 0;
+
+		while (so->queueUsed < maxitems)
+		{
+			if (!getNextNearest(scan))
+				break;
+
+			item = &so->queueItems[so->queueUsed++];
+
+			item->recheck = scan->xs_recheck;
+			item->heapPtr = scan->xs_heaptid;
+			item->recontup = scan->xs_hitup;
+
+			/*
+			 * FIXME free the memory (for tuples and orderbyvals/orderbynulls)
+			 * it's leaking now.
+			 */
+			item->orderbyvals = palloc0(sizeof(Datum) * scan->numberOfOrderBys);
+			item->orderbynulls = palloc0(sizeof(bool) * scan->numberOfOrderBys);
+
+			/*
+			 * copy the distances - might be float8, which may be byref, so use
+			 * datumCopy, otherwise it gets clobbered by other items
+			 */
+			for (int i = 0; i < scan->numberOfOrderBys; i++)
+			{
+				int16   typlen;
+				bool    typbyval;
+
+				/* don't copy NULL values */
+				if (scan->xs_orderbynulls[i])
+					continue;
+
+				get_typlenbyval(so->orderByTypes[i], &typlen, &typbyval);
+
+				item->orderbyvals[i] = datumCopy(scan->xs_orderbyvals[i],
+												 typbyval, typlen);
+			}
+
+			memcpy(item->orderbynulls,
+				   scan->xs_orderbynulls,
+				   sizeof(bool) * scan->numberOfOrderBys);
+
+			/* reset, so that we don't free it accidentally */
+			scan->xs_hitup = NULL;
+		}
+
+		/* found no new items, we're done */
+		if (so->queueUsed == 0)
+			return false;
+
+		/* restart the stream for the new queue */
+		so->queueStream = -1;
+		so->lastBlock = InvalidBlockNumber; /* XXX needed? */
+		read_stream_reset(scan->xs_rs);
+	}
+
+	/* next item to return */
+	item = &so->queueItems[so->queueItem++];
+
+	scan->xs_heaptid = item->heapPtr;
+	scan->xs_recheck = item->recheck;
+
+	/* here it's fine to copy the datum (even if byref pointers) */
+	memcpy(scan->xs_orderbyvals,
+		   item->orderbyvals,
+		   sizeof(Datum) * scan->numberOfOrderBys);
+
+	memcpy(scan->xs_orderbynulls,
+		   item->orderbynulls,
+		   sizeof(bool) * scan->numberOfOrderBys);
+
+	/* in an index-only scan, also return the reconstructed tuple. */
+	if (scan->xs_want_itup)
+		scan->xs_hitup = item->recontup;
+
+	if (scan->xs_want_itup && ! item->allVisibleSet)
+	{
+		item->allVisibleSet = true;
+		item->allVisible
+			= VM_ALL_VISIBLE(scan->heapRelation,
+							 ItemPointerGetBlockNumber(&item->heapPtr),
+							 &so->vmBuffer);
+	}
+
+	scan->xs_visible = item->allVisible;
+
+	return true;
+}
+
 /*
  * gistgettuple() -- Get the next tuple in the scan
  */
@@ -642,7 +778,10 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 	if (scan->numberOfOrderBys > 0)
 	{
 		/* Must fetch tuples in strict distance order */
-		return getNextNearest(scan);
+		if (scan->xs_rs)
+			return getNextNearestPrefetch(scan);
+		else
+			return getNextNearest(scan);
 	}
 	else
 	{
@@ -677,6 +816,18 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 				if (scan->xs_want_itup)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
+				/* determine VM status, if not done already */
+				if (scan->xs_want_itup && !so->pageData[so->curPageData].allVisibleSet)
+				{
+					so->pageData[so->curPageData].allVisibleSet = true;
+					so->pageData[so->curPageData].allVisible
+						= VM_ALL_VISIBLE(scan->heapRelation,
+										 ItemPointerGetBlockNumber(&scan->xs_heaptid),
+										 &so->vmBuffer);
+				}
+
+				scan->xs_visible = so->pageData[so->curPageData].allVisible;
+
 				so->curPageData++;
 
 				return true;
@@ -734,6 +885,13 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 
 				pfree(item);
 			} while (so->nPageData == 0);
+
+			if (scan->xs_rs)
+			{
+				so->streamPageData = -1;
+				so->lastBlock = InvalidBlockNumber;
+				read_stream_reset(scan->xs_rs);
+			}
 		}
 	}
 }
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index d8ba7f7eff5..df05f282aa1 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -17,6 +17,8 @@
 #include "access/gist_private.h"
 #include "access/gistscan.h"
 #include "access/relscan.h"
+#include "access/table.h"
+#include "optimizer/cost.h"
 #include "utils/float.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -70,6 +72,176 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
  * Index AM API functions for scanning GiST indexes
  */
 
+/*
+ * gist_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * With index-only scans this skips all-visible pages. The visibility info
+ * is stored, so that we can later pass it to the scan (we must not access
+ * the VM again, the bit might have changes, and the read stream would get
+ * out of sync (we'd get different blocks than we expect expect).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+gist_stream_read_next(ReadStream *stream,
+					  void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->streamPageData == (OffsetNumber) - 1)
+		so->streamPageData = (so->curPageData - 1);
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while (so->streamPageData < so->nPageData)
+	{
+		ItemPointer		tid;
+		GISTSearchHeapItem  *item;
+
+		item = &so->pageData[so->streamPageData];
+
+		tid = &item->heapPtr;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!item->allVisibleSet)
+			{
+				item->allVisibleSet = true;
+				item->allVisible = VM_ALL_VISIBLE(scan->heapRelation,
+												  ItemPointerGetBlockNumber(tid),
+												  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (item->allVisible)
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item, assuming the current scan direction */
+		so->streamPageData++;
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
+/*
+ * gist_ordered_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * A variant of gist_stream_read_next for ordered scans, returning items from
+ * a small secondary queue.
+ */
+static BlockNumber
+gist_ordered_stream_read_next(ReadStream *stream,
+							  void *callback_private_data,
+							  void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->queueStream == - 1)
+		so->queueStream = (so->queueItem - 1);
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while (so->queueStream < so->queueUsed)
+	{
+		ItemPointer		tid;
+		GISTSearchHeapItem  *item;
+
+		item = &so->queueItems[so->queueStream];
+
+		tid = &item->heapPtr;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!item->allVisibleSet)
+			{
+				item->allVisibleSet = true;
+				item->allVisible = VM_ALL_VISIBLE(scan->heapRelation,
+												  ItemPointerGetBlockNumber(tid),
+												  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (item->allVisible)
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item, assuming the current scan direction */
+		so->queueStream++;
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 IndexScanDesc
 gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
@@ -110,6 +282,14 @@ gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 	so->numKilled = 0;
 	so->curBlkno = InvalidBlockNumber;
 	so->curPageLSN = InvalidXLogRecPtr;
+	so->vmBuffer = InvalidBuffer;
+
+	/* initialize small prefetch queue */
+	so->queueUsed = 0;
+	so->queueItem = 0;
+
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
 
 	scan->opaque = so;
 
@@ -120,6 +300,31 @@ gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 
 	MemoryContextSwitchTo(oldCxt);
 
+	/*
+	 * Initialize the read stream to opt-in into prefetching.
+	 *
+	 * XXX See the comments in btbeginscan().
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		if (scan->numberOfOrderBys == 0)
+			scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+													 NULL,
+													 heap,
+													 MAIN_FORKNUM,
+													 gist_stream_read_next,
+													 scan,
+													 0);
+		else
+			scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+													 NULL,
+													 heap,
+													 MAIN_FORKNUM,
+													 gist_ordered_stream_read_next,
+													 scan,
+													 0);
+	}
+
 	return scan;
 }
 
@@ -341,6 +546,15 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 
 	/* any previous xs_hitup will have been pfree'd in context resets above */
 	scan->xs_hitup = NULL;
+
+	/* reset stream */
+	if (scan->xs_rs)
+	{
+		so->streamPageData = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+		so->queueItem = so->queueUsed = so->queueStream = 0;
+	}
 }
 
 void
@@ -348,9 +562,20 @@ gistendscan(IndexScanDesc scan)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
 
+	/* needs to happen before freeGISTstate */
+	if (so->vmBuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(so->vmBuffer);
+		so->vmBuffer = InvalidBuffer;
+	}
+
 	/*
 	 * freeGISTstate is enough to clean up everything made by gistbeginscan,
 	 * as well as the queueCxt if there is a separate context for it.
 	 */
 	freeGISTstate(so->giststate);
+
+	/* reset stream */
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
 }
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..924bbae22e2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -17,6 +17,7 @@
 #include "access/amapi.h"
 #include "access/gist.h"
 #include "access/itup.h"
+#include "access/visibilitymap.h"
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
@@ -124,6 +125,10 @@ typedef struct GISTSearchHeapItem
 								 * index-only scans */
 	OffsetNumber offnum;		/* track offset in page to mark tuple as
 								 * LP_DEAD */
+	bool		allVisible;
+	bool		allVisibleSet;
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
 } GISTSearchHeapItem;
 
 /* Unvisited item, either index page or heap tuple */
@@ -169,13 +174,24 @@ typedef struct GISTScanOpaqueData
 	int			numKilled;		/* number of currently stored items */
 	BlockNumber curBlkno;		/* current number of block */
 	GistNSN		curPageLSN;		/* pos in the WAL stream when page was read */
+	Buffer		vmBuffer;
 
 	/* In a non-ordered search, returnable heap items are stored here: */
 	GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
 	OffsetNumber nPageData;		/* number of valid items in array */
 	OffsetNumber curPageData;	/* next item to return */
+	OffsetNumber streamPageData;	/* next item to queue */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+
+	/* last block returned by the read_next stream callback */
+	BlockNumber	lastBlock;
+
+	/* queue to allow prefetching with ordered scans */
+	GISTSearchHeapItem	queueItems[32];
+	int					queueItem;
+	int					queueStream;
+	int					queueUsed;
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
-- 
2.50.0

From 19ff92d7034e726c840c83946d87c980a48a594d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 25 Apr 2025 14:52:56 +0200
Subject: [PATCH v20250709 5/6] prefetch for spgist indexes

Implements the spg_stream_read_next() callback, returning blocks from
SpGistScanOpaque.

Similar to GiST, this handles both regular and ordered scans, but with
just a single read_next callback.

Note: Right now the batches are always 32 items, which may regress
queries with LIMIT clauses, etc. It should start at 1 and gradually
increase the batch size. Similarly to how prefetch distance grows.

XXX I wonder if GiST could be simplified to use a single callback too,
or if SP-GiST is buggy and needs to use two callbacks.
---
 src/backend/access/spgist/spgscan.c | 187 +++++++++++++++++++++++++++-
 src/include/access/spgist_private.h |  11 ++
 2 files changed, 195 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 655f5cdc1eb..c90703a522e 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/relscan.h"
 #include "access/spgist_private.h"
+#include "access/table.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -300,6 +301,95 @@ spgPrepareScanKeys(IndexScanDesc scan)
 	}
 }
 
+/*
+ * spg_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * With index-only scans this skips all-visible pages. The visibility info
+ * is stored, so that we can later pass it to the scan (we must not access
+ * the VM again, the bit might have changes, and the read stream would get
+ * out of sync (we'd get different blocks than we expect expect).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+spg_stream_read_next(ReadStream *stream,
+					 void *callback_private_data,
+					 void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->sPtr == -1)
+		so->sPtr = (so->iPtr - 1);
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while (so->sPtr < so->nPtrs)
+	{
+		ItemPointer		tid;
+
+		tid = &so->heapPtrs[so->sPtr];
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!so->allVisibleSet[so->sPtr])
+			{
+				so->allVisibleSet[so->sPtr] = true;
+				so->allVisible[so->sPtr] = VM_ALL_VISIBLE(scan->heapRelation,
+														  ItemPointerGetBlockNumber(tid),
+														  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (so->allVisible[so->sPtr])
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item (forward scans only) */
+		so->sPtr++;
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 IndexScanDesc
 spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 {
@@ -371,8 +461,30 @@ spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 
 	so->indexCollation = index->rd_indcollation[0];
 
+	/* access to VM for IOS scans (in read_next callback) */
+	so->vmBuffer = InvalidBuffer;
+
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
+
 	scan->opaque = so;
 
+	/*
+	 * Initialize the read stream to opt-in into prefetching.
+	 *
+	 * XXX See comments in btbeginscan().
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+												 NULL,
+												 heap,
+												 MAIN_FORKNUM,
+												 spg_stream_read_next,
+												 scan,
+												 0);
+	}
+
 	return scan;
 }
 
@@ -423,6 +535,14 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	pgstat_count_index_scan(scan->indexRelation);
 	if (scan->instrument)
 		scan->instrument->nsearches++;
+
+	/* reset the stream, so that rescan starts from scratch */
+	if (scan->xs_rs)
+	{
+		so->sPtr = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 void
@@ -453,6 +573,15 @@ spgendscan(IndexScanDesc scan)
 		pfree(scan->xs_orderbynulls);
 	}
 
+	if (so->vmBuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(so->vmBuffer);
+		so->vmBuffer = InvalidBuffer;
+	}
+
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
+
 	pfree(so);
 }
 
@@ -818,9 +947,25 @@ spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
 		storeRes_func storeRes)
 {
 	Buffer		buffer = InvalidBuffer;
-	bool		reportedSome = false;
+	int			reportedCount = 0;
 
-	while (scanWholeIndex || !reportedSome)
+	/*
+	 * XXX Read at least 32 items into the queue, to make prefetching work.
+	 *
+	 * XXX We should gradually increase the number of tuples to load, not read
+	 * 32 tuples from the very beginning, similar to how we increase the
+	 * prefetch distance. That might be harmful for queries with LIMIT clause.
+	 *
+	 * XXX Not sure this is quite safe. The the arrays are sized to fit at
+	 * least MaxIndexTuplesPerPage items, but what if there's a page with 31
+	 * items, and then another page with MaxIndexTuplesPerPage? Then we might
+	 * overflow the arrays (in the while loop below), I think.
+	 *
+	 * XXX I wonder if this is actually needed. Maybe it's needed only for
+	 * ordered scans, when we get the items from the pairing heap one by one.
+	 * So maybe we should do this buffering only in that case?
+	 */
+	while (scanWholeIndex || (reportedCount < 32))
 	{
 		SpGistSearchItem *item = spgGetNextQueueItem(so);
 
@@ -838,7 +983,7 @@ redirect:
 			storeRes(so, &item->heapPtr, item->value, item->isNull,
 					 item->leafTuple, item->recheck,
 					 item->recheckDistances, item->distances);
-			reportedSome = true;
+			reportedCount++;
 		}
 		else
 		{
@@ -872,23 +1017,33 @@ redirect:
 
 				if (SpGistBlockIsRoot(blkno))
 				{
+					bool	reportedSome = false;
+
 					/* When root is a leaf, examine all its tuples */
 					for (offset = FirstOffsetNumber; offset <= max; offset++)
 						(void) spgTestLeafTuple(so, item, page, offset,
 												isnull, true,
 												&reportedSome, storeRes);
+
+					if (reportedSome)
+						reportedCount++;
 				}
 				else
 				{
 					/* Normal case: just examine the chain we arrived at */
 					while (offset != InvalidOffsetNumber)
 					{
+						bool	reportedSome = false;
+
 						Assert(offset >= FirstOffsetNumber && offset <= max);
 						offset = spgTestLeafTuple(so, item, page, offset,
 												  isnull, false,
 												  &reportedSome, storeRes);
 						if (offset == SpGistRedirectOffsetNumber)
 							goto redirect;
+
+						if (reportedSome)
+							reportedCount++;
 					}
 				}
 			}
@@ -1042,6 +1197,18 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 			scan->xs_recheck = so->recheck[so->iPtr];
 			scan->xs_hitup = so->reconTups[so->iPtr];
 
+			/* determine and store the VM status, if not done already */
+			if (scan->xs_want_itup && !so->allVisibleSet[so->iPtr])
+			{
+				so->allVisibleSet[so->iPtr] = true;
+				so->allVisible[so->iPtr]
+					= VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(&so->heapPtrs[so->iPtr]),
+									 &so->vmBuffer);
+			}
+
+			scan->xs_visible = so->allVisible[so->iPtr];
+
 			if (so->numberOfOrderBys > 0)
 				index_store_float8_orderby_distances(scan, so->orderByTypes,
 													 so->distances[so->iPtr],
@@ -1074,6 +1241,20 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 
 		if (so->nPtrs == 0)
 			break;				/* must have completed scan */
+
+		/*
+		 * loaded a leaf page worth of tuples, restart stream
+		 *
+		 * XXX with ordered scans we typically get nPtrs=1, which means the
+		 * prefetch can't really benefit anything. Maybe we should queue a
+		 * couple items and then prefetch those?
+		 */
+		if (scan->xs_rs)
+		{
+			so->sPtr = -1;
+			so->lastBlock = InvalidBlockNumber;
+			read_stream_reset(scan->xs_rs);
+		}
 	}
 
 	return false;
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..46c50041ee1 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -16,8 +16,10 @@
 
 #include "access/itup.h"
 #include "access/spgist.h"
+#include "access/visibilitymap.h"
 #include "catalog/pg_am_d.h"
 #include "nodes/tidbitmap.h"
+#include "optimizer/cost.h"
 #include "storage/buf.h"
 #include "utils/geo_decls.h"
 #include "utils/relcache.h"
@@ -226,15 +228,24 @@ typedef struct SpGistScanOpaqueData
 	TupleDesc	reconTupDesc;	/* if so, descriptor for reconstructed tuples */
 	int			nPtrs;			/* number of TIDs found on current page */
 	int			iPtr;			/* index for scanning through same */
+	int			sPtr;			/* index for scanning through same (for stream) */
 	ItemPointerData heapPtrs[MaxIndexTuplesPerPage];	/* TIDs from cur page */
 	bool		recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
 	bool		recheckDistances[MaxIndexTuplesPerPage];	/* distance recheck
 															 * flags */
 	HeapTuple	reconTups[MaxIndexTuplesPerPage];	/* reconstructed tuples */
 
+	/* for IOS */
+	bool		allVisible[MaxIndexTuplesPerPage];
+	bool		allVisibleSet[MaxIndexTuplesPerPage];
+	Buffer		vmBuffer;
+
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
 
+	/* last block returned by the read_next stream callback */
+	BlockNumber	lastBlock;
+
 	/*
 	 * Note: using MaxIndexTuplesPerPage above is a bit hokey since
 	 * SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
-- 
2.50.0

From 34a2e9e21ac6635cf893c9dc69603e50ac5cf284 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 25 Apr 2025 16:07:14 +0200
Subject: [PATCH v20250709 6/6] add prefetch info to explain

show whether prefetch is enabled and average distance
---
 src/backend/access/gist/gistget.c   |  9 +++++
 src/backend/access/gist/gistscan.c  |  2 ++
 src/backend/access/hash/hash.c      | 12 +++++++
 src/backend/access/nbtree/nbtree.c  | 12 +++++++
 src/backend/access/spgist/spgscan.c | 14 ++++++++
 src/backend/commands/explain.c      | 56 +++++++++++++++++++++++++++++
 src/include/access/relscan.h        |  5 +++
 7 files changed, 110 insertions(+)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 7e292ebb442..c5b4ec94616 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -830,6 +830,15 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 
 				so->curPageData++;
 
+				if (scan->xs_rs)
+				{
+					if (so->streamPageData != (OffsetNumber) -1)
+					{
+						scan->xs_rs_count++;
+						scan->xs_rs_distance += abs(so->curPageData - so->streamPageData);
+					}
+				}
+
 				return true;
 			}
 
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index df05f282aa1..3fe630a87c0 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -323,6 +323,8 @@ gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 													 gist_ordered_stream_read_next,
 													 scan,
 													 0);
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0884f0e05d9..65b4bba0682 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -333,6 +333,15 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		res = _hash_next(scan, dir);
 	}
 
+	if (scan->xs_rs)
+	{
+		if (so->currPos.streamIndex != -1)
+		{
+			scan->xs_rs_count++;
+			scan->xs_rs_distance += (so->currPos.streamIndex - so->currPos.itemIndex);
+		}
+	}
+
 	return res;
 }
 
@@ -485,6 +494,9 @@ hashbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 												 hash_stream_read_next,
 												 scan,
 												 0);
+
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 0fa4af79dac..d7242405803 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -280,6 +280,15 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
+	if (scan->xs_rs)
+	{
+		if (so->currPos.streamIndex != -1)
+		{
+			scan->xs_rs_count++;
+			scan->xs_rs_distance += (so->currPos.streamIndex - so->currPos.itemIndex);
+		}
+	}
+
 	return res;
 }
 
@@ -503,6 +512,9 @@ btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 												 bt_stream_read_next,
 												 scan,
 												 0);
+
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index c90703a522e..9c87d364391 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -483,6 +483,9 @@ spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 												 spg_stream_read_next,
 												 scan,
 												 0);
+
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
@@ -1214,6 +1217,17 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 													 so->distances[so->iPtr],
 													 so->recheckDistances[so->iPtr]);
 			so->iPtr++;
+
+
+	if (scan->xs_rs)
+	{
+		if (so->sPtr != -1)
+		{
+			scan->xs_rs_count++;
+			scan->xs_rs_distance += abs(so->iPtr - so->sPtr);
+		}
+	}
+
 			return true;
 		}
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..d978d79e0f6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/xact.h"
+#include "access/relscan.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
 #include "commands/defrem.h"
@@ -136,6 +137,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
 static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_index_prefetch_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1966,6 +1968,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			show_indexsearches_info(planstate, es);
+			show_index_prefetch_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1983,6 +1986,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Heap Fetches", NULL,
 									 planstate->instrument->ntuples2, 0, es);
 			show_indexsearches_info(planstate, es);
+			show_index_prefetch_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -3889,6 +3893,58 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
+
+static void
+show_index_prefetch_info(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+	bool		prefetch = false;
+	float		distance = 0;
+
+	if (!es->analyze)
+		return;
+
+	if (!es->verbose)
+		return;
+
+	/* Initialize counters with stats from the local process first */
+	switch (nodeTag(plan))
+	{
+		case T_IndexScan:
+			{
+				IndexScanState *indexstate = ((IndexScanState *) planstate);
+
+				if (indexstate->iss_ScanDesc)
+				{
+					prefetch = (indexstate->iss_ScanDesc->xs_rs != NULL);
+
+					if (indexstate->iss_ScanDesc->xs_rs)
+						distance = (indexstate->iss_ScanDesc->xs_rs_distance / (float) Max(1, indexstate->iss_ScanDesc->xs_rs_count));
+				}
+				break;
+			}
+		case T_IndexOnlyScan:
+			{
+				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
+
+				if (indexstate->ioss_ScanDesc)
+				{
+					prefetch = (indexstate->ioss_ScanDesc->xs_rs != NULL);
+
+					if (indexstate->ioss_ScanDesc->xs_rs)
+						distance = (indexstate->ioss_ScanDesc->xs_rs_distance / (float) Max(1, indexstate->ioss_ScanDesc->xs_rs_count));
+				}
+				break;
+			}
+		default:
+			break;
+	}
+
+	/* Next get the sum of the counters set within each and every process */
+	ExplainPropertyBool("Index Prefetch", prefetch, es);
+	ExplainPropertyFloat("Index Distance", NULL, distance, 1, es);
+}
+
 /*
  * Show exact/lossy pages for a BitmapHeapScan node
  */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 56e6c6245e5..ab524bd9b18 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -176,8 +176,13 @@ typedef struct IndexScanDescData
 	bool		xs_heap_continue;	/* T if must keep walking, potential
 									 * further results */
 	IndexFetchTableData *xs_heapfetch;
+
 	ReadStream *xs_rs;			/* read_stream (if supported by the AM) */
 
+	/* stats for explain */
+	int			xs_rs_distance;
+	int			xs_rs_count;
+
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
 	/*
-- 
2.50.0

Re: index prefetching

Reply via email to