Re: index prefetching

Tomas Vondra Thu, 04 Jan 2024 06:55:39 -0800

Hi,

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

(1) The StreamingRead API seems to be designed for pages, but the index
code naturally works with TIDs/tuples. Yes, the callbacks can associate
the blocks with custom data (in this case that'd be the TID), but it
seemed a bit strange ...

(2) The place adding requests to the StreamingRead queue is pretty far
from the place actually reading the pages - for prefetching, the
requests would be generated in nodeIndexscan, but the page reading
happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple.
Sure, the TIDs would come from a callback, so it's a bit as if the
requests were generated in heapam_index_fetch_tuple - but it has no idea
StreamingRead exists, so where would it get it.

We might teach it about it, but what if there are multiple places
calling index_fetch_heap()? Not all of which may be using StreamingRead
(only indexscans would do that). Or if there are multiple index scans,
there's need to be a separate StreamingRead queues, right?

In any case, I felt a bit out of my depth here, and I chose not to do
all this work without discussing the direction here. (Also, see the
point about cursors and xs_heap_continue a bit later in this post.)

I did however like the general StreamingRead API - how it splits the
work between the API and the callback. The patch used to do everything,
which meant it hardcoded a lot of the IOS-specific logic etc. I did plan
to have some sort of "callback" for reading from the queue, but that
didn't quite solve this issue - a lot of the stuff remained hard-coded.
But the StreamingRead API made me realize that having a callback for the
first phase (that adds requests to the queue) would fix that.

So I did that - there's now one simple callback in for index scans, and
a bit more complex callback for index-only scans. Thanks to this the
hard-coded stuff mostly disappears, which is good.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch introduces an API similar to StreamingRead. It calls the
indexam.c stuff, but does all the prefetching on top of it, not in it.
If a place calling index_getnext_tid() wants to allow prefetching, it
needs to switch to IndexPrefetchNext(). (There's no function that would
replace index_getnext_slot, at the moment. Maybe there should be.)

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

Note 2: I left the code in indexam.c for now, but in principle it could
(should) be moved to a different place.

I think this layering makes sense, and it's probably much closer to what
Andres meant when he said the prefetching should happen in the executor.
Even if the patch ends up using StreamingRead in the future, I guess
we'll want something like IndexPrefetch - it might use the StreamingRead
internally, but it would still need to do some custom stuff to detect
I/O patterns or something that does not quite fit into the StreamingRead.

Now, let's talk about two (mostly unrelated) problems I ran into.

Firstly, I realized there's a bit of a problem with cursors. The
prefetching works like this:

1) reading TIDs from the index
2) stashing them into a queue in IndexPrefetch
3) doing prefetches for the new TIDs added to the queue
4) returning the TIDs to the caller, one by one

And all of this works ... unless the direction of the scan changes.
Which for cursors can happen if someone does FETCH BACKWARD or stuff
like that. I'm not sure how difficult it'd be to make this work. I
suppose we could simply discard the prefetched entries and do the right
number of steps back for the index scan. But I haven't tried, and maybe
it's more complex than I'm imagining. Also, if the cursor changes the
direction a lot, it'd make the prefetching harmful.

The patch simply disables prefetching for such queries, using the same
logic that we do for parallelism. This may be over-zealous.

FWIW this is one of the things that probably should remain outside of
StreamingRead API - it seems pretty index-specific, and I'm not sure
we'd even want to support these "backward" movements in the API.

The other issue I'm aware of is handling xs_heap_continue. I believe it
works fine for "false" but I need to take a look at non-MVCC snapshots
(i.e. when xs_heap_continue=true).

I haven't done any benchmarks with this reworked API - there's a couple
more allocations etc. but it did not change in a fundamental way. I
don't expect any major difference.

regards

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From 74bd0d6b70fa8ca3a1b26196de6b7a9cc670ac9b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240103 1/2] prefetch 2023-12-09

Patch version shared on 2023/12/09.
---
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/index/genam.c         |   4 +-
 src/backend/access/index/indexam.c       | 551 ++++++++++++++++++++++-
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c |  99 +++-
 src/backend/executor/nodeIndexscan.c     |  71 ++-
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               | 115 ++++-
 src/include/executor/instrument.h        |   2 +
 src/include/nodes/execnodes.h            |   4 +
 13 files changed, 868 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..26d3ec20b63 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -792,7 +792,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot, NULL))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4ca12006843..72e7c9f206c 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -509,7 +509,7 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot, NULL))
 		{
 			bool		shouldFree;
 
@@ -713,7 +713,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot, NULL))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f23e0199f08..f96aeba1b39 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,16 +49,19 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -108,6 +111,13 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 
+static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+								IndexPrefetch *prefetch);
+static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+										  IndexPrefetch *prefetch, bool *all_visible);
+static void index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+						   ItemPointer tid, bool skip_all_visible, bool *all_visible);
+
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -536,8 +546,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+static ItemPointer
+index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -636,16 +646,21 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
  * ----------------
  */
 bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
+index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot,
+				   IndexPrefetch *prefetch)
 {
 	for (;;)
 	{
+		/* Do prefetching (if requested/enabled). */
+		index_prefetch_tids(scan, direction, prefetch);
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			ItemPointer	tid;
+			bool		all_visible;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+			tid = index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1003,3 +1018,529 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
+ *
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
+ *
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
+ *
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
+ *
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total (these are the default parameters.)
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the queried block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch *prefetch, BlockNumber block)
+{
+	IndexPrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	/* FIXME do a nice macro */
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * index_prefetch
+ *		Prefetch the TID, unless it's sequential or recently prefetched.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and use that to clamp prefetch target.
+ *
+ * That'd help with cases when a scan matches only very few rows, far less
+ * than the prefetchTarget, because the unnecessary prefetches are wasted
+ * I/O. Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * Another option is to use the planner estimates - we know how many rows we're
+ * expecting to fetch (on average, assuming the estimates are reasonably
+ * accurate), so why not to use that?
+ *
+ * Of course, we could/should combine these two approaches.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchReqNumber value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ */
+static void
+index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+			   ItemPointer tid, bool skip_all_visible, bool *all_visible)
+{
+	BlockNumber block;
+
+	/* by default not all visible (or we didn't check) */
+	*all_visible = false;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at the
+	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
+	 * the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * When prefetching for IOS, we want to only prefetch pages that are not
+	 * marked as all-visible (because not fetching all-visible pages is the
+	 * point of IOS).
+	 *
+	 * XXX This is not great, because it releases the VM buffer for each TID
+	 * we consider to prefetch. We should reuse that somehow, similar to the
+	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
+	 * we can propagate it here). Or at least do it for a bulk of prefetches,
+	 * although that's not very useful - after the ramp-up we will prefetch
+	 * the pages one by one anyway.
+	 *
+	 * XXX Ideally we'd also propagate this to the executor, so that the
+	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
+	 * is measurable). But the index_getnext_tid() is not really well
+	 * suited for that, so the API needs a change.s
+	 */
+	if (skip_all_visible)
+	{
+		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									  block,
+									  &prefetch->vmBuffer);
+
+		if (*all_visible)
+			return;
+	}
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+
+	prefetch->countAll++;
+}
+
+/* ----------------
+ * index_getnext_tid - get the next TID from a scan
+ *
+ * The result is the next TID satisfying the scan keys,
+ * or NULL if no more matching tuples exist.
+ *
+ * FIXME not sure this handles xs_heapfetch correctly.
+ * ----------------
+ */
+ItemPointer
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction,
+				  IndexPrefetch *prefetch)
+{
+	bool		all_visible;	/* ignored */
+
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction, prefetch);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
+}
+
+ItemPointer
+index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction,
+					 IndexPrefetch *prefetch, bool *all_visible)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction, prefetch);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, prefetch, all_visible);
+}
+
+static void
+index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+					IndexPrefetch *prefetch)
+{
+	/*
+	 * If the prefetching is still active (i.e. enabled and we still
+	 * haven't finished reading TIDs from the scan), read enough TIDs into
+	 * the queue until we hit the current target.
+	 */
+	if (PREFETCH_ACTIVE(prefetch))
+	{
+		/*
+		 * Ramp up the prefetch distance incrementally.
+		 *
+		 * Intentionally done as first, before reading the TIDs into the
+		 * queue, so that there's always at least one item. Otherwise we
+		 * might get into a situation where we start with target=0 and no
+		 * TIDs loaded.
+		 */
+		prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+									   prefetch->prefetchMaxTarget);
+
+		/*
+		 * Now read TIDs from the index until the queue is full (with
+		 * respect to the current prefetch target).
+		 */
+		while (!PREFETCH_FULL(prefetch))
+		{
+			ItemPointer tid;
+			bool		all_visible;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_internal(scan, direction);
+
+			/*
+			 * If we're out of index entries, we're done (and we mark the
+			 * the prefetcher as inactive).
+			 */
+			if (tid == NULL)
+			{
+				prefetch->prefetchDone = true;
+				break;
+			}
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+			/*
+			 * Issue the actuall prefetch requests for the new TID.
+			 *
+			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
+			 * so skip prefetching of all-visible pages.
+			 */
+			index_prefetch(scan, prefetch, tid, prefetch->indexonly, &all_visible);
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
+			prefetch->queueEnd++;
+		}
+	}
+}
+
+static ItemPointer
+index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+					   IndexPrefetch *prefetch, bool *all_visible)
+{
+	/*
+	 * With prefetching enabled (even if we already finished reading
+	 * all TIDs from the index scan), we need to return a TID from the
+	 * queue. Otherwise, we just get the next TID from the scan
+	 * directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
+		prefetch->queueIndex++;
+	}
+	else				/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid_internal(scan, direction);
+		*all_visible = false;
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+	}
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e8..6810996edfd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 2fa2118f3c2..0a136db6712 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -777,7 +777,11 @@ retry:
 	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	/*
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot, NULL))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..9498b00fa64 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -212,8 +212,13 @@ retry:
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
-	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	/*
+	 * Try to find the tuple
+	 *
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(scan, ForwardScanDirection, outslot, NULL))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c383f34c066..0011d9f679c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f1db35665c8..a7eadaf3db2 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,7 +43,7 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-
+#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
@@ -65,6 +65,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	IndexPrefetch  *prefetch;
+	bool			all_visible;
 
 	/*
 	 * extract necessary information from index scan node
@@ -78,6 +80,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -116,7 +119,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_vm(scandesc, direction, prefetch, &all_visible)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -155,8 +158,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!all_visible &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -353,6 +359,16 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->ioss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -380,6 +396,26 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* Release VM buffer pin from prefetcher, if any. */
+	if (node->ioss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		/* XXX some debug info */
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+
+		if (prefetch->vmBuffer != InvalidBuffer)
+		{
+			ReleaseBuffer(prefetch->vmBuffer);
+			prefetch->vmBuffer = InvalidBuffer;
+		}
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -604,6 +640,63 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		indexstate->ioss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int			prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		/*
+		 * We reach here if the index only scan is not parallel, or if we're
+		 * serially executing an index only scan that was planned to be
+		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
+		 *
+		 * Remember this is index-only scan, because of prefetching. Not the most
+		 * elegant way to pass this info.
+		 */
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+			prefetch->indexonly = true;
+
+			indexstate->ioss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 14b9c00217a..b3282ec5a75 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexPrefetch  *prefetch;
 
 	/*
 	 * extract necessary information from index scan node
@@ -98,6 +100,7 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -128,7 +131,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (index_getnext_slot(scandesc, direction, slot, prefetch))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -177,6 +180,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
+	IndexPrefetch *prefetch;
 
 	estate = node->ss.ps.state;
 
@@ -193,6 +197,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Assert(ScanDirectionIsForward(estate->es_direction));
 
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -259,7 +264,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot, prefetch))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -588,6 +593,16 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -794,6 +809,19 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX nothing to free, but print some debug info */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1066,6 +1094,45 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int	prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+
+			indexstate->iss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e11d022827a..b5c79359425 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6297,7 +6297,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while ((tid = index_getnext_tid(index_scan, indexscandir, NULL)) != NULL)
 	{
 		BlockNumber block = ItemPointerGetBlockNumber(tid);
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 80dc8d54066..c0c46d7a05f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -128,6 +129,110 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct IndexPrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData		tid;
+	bool				all_visible;
+} IndexPrefetchEntry;
+
+typedef struct IndexPrefetch
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/* used when prefetching index-only scans */
+	bool		indexonly;
+	Buffer		vmBuffer;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	IndexPrefetchEntry	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	IndexPrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetch;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -173,11 +278,17 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction);
+									 ScanDirection direction,
+									 IndexPrefetch *prefetch);
+extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
+										ScanDirection direction,
+										IndexPrefetch *prefetch,
+										bool *all_visible);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   struct TupleTableSlot *slot);
+							   struct TupleTableSlot *slot,
+							   IndexPrefetch *prefetch);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d5d69941c52..f53fb4a1e51 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d7f17dee07..8745453a5b4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1529,6 +1529,7 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1580,6 +1581,8 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1618,6 +1621,7 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
-- 
2.43.0

From b9021c498bb273055f8cf8809030c4abc7848737 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Mon, 1 Jan 2024 21:50:47 +0100
Subject: [PATCH v20240103 2/2] switch to StreamingRead-like API

---
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/index/README.prefetch |   3 +
 src/backend/access/index/genam.c         |   4 +-
 src/backend/access/index/indexam.c       | 254 ++++++++++++-----------
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/nodeIndexonlyscan.c | 149 +++++++------
 src/backend/executor/nodeIndexscan.c     | 105 ++++++----
 src/backend/optimizer/plan/createplan.c  |  27 ++-
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               |  68 ++++--
 src/include/nodes/execnodes.h            |   4 +-
 src/include/nodes/plannodes.h            |   2 +
 13 files changed, 382 insertions(+), 253 deletions(-)
 create mode 100644 src/backend/access/index/README.prefetch

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 26d3ec20b63..7c28dafb728 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -792,7 +792,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot, NULL))
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
diff --git a/src/backend/access/index/README.prefetch b/src/backend/access/index/README.prefetch
new file mode 100644
index 00000000000..2a6ac4a0eea
--- /dev/null
+++ b/src/backend/access/index/README.prefetch
@@ -0,0 +1,3 @@
+- index heap prefetch overview
+- 
+- callback - decision whether to prefetch, possibility to keep data
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 72e7c9f206c..4ca12006843 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -509,7 +509,7 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot, NULL))
+		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -713,7 +713,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot, NULL))
+	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f96aeba1b39..cdad3f4c6f9 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,7 +49,6 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
-#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
@@ -64,6 +63,7 @@
 #include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
 
@@ -111,13 +111,16 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 
-static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
-								IndexPrefetch *prefetch);
-static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
-										  IndexPrefetch *prefetch, bool *all_visible);
-static void index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
-						   ItemPointer tid, bool skip_all_visible, bool *all_visible);
-
+/* index prefetching of heap pages */
+static void index_prefetch_tids(IndexScanDesc scan,
+								IndexPrefetch *prefetch,
+								ScanDirection direction);
+static IndexPrefetchEntry *index_prefetch_get_entry(IndexScanDesc scan,
+												    IndexPrefetch *prefetch,
+													ScanDirection direction);
+static void index_prefetch_heap_page(IndexScanDesc scan,
+									 IndexPrefetch *prefetch,
+									 IndexPrefetchEntry *entry);
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -546,8 +549,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-static ItemPointer
-index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
+ItemPointer
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -643,24 +646,22 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
  * Note: caller must check scan->xs_recheck, and perform rechecking of the
  * scan keys if required.  We do not do that here because we don't have
  * enough information to do it efficiently in the general case.
+ *
+ * XXX This does not support prefetching of heap pages. When such prefetching is
+ * desirable, use index_getnext_tid().
  * ----------------
  */
 bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot,
-				   IndexPrefetch *prefetch)
+index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
 	for (;;)
 	{
-		/* Do prefetching (if requested/enabled). */
-		index_prefetch_tids(scan, direction, prefetch);
-
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer	tid;
-			bool		all_visible;
+			ItemPointer tid;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
+			tid = index_getnext_tid(scan, direction);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1339,13 +1340,9 @@ index_prefetch_add_cache(IndexPrefetch *prefetch, BlockNumber block)
  * value once in a while, and see what happens.
  */
 static void
-index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
-			   ItemPointer tid, bool skip_all_visible, bool *all_visible)
+index_prefetch_heap_page(IndexScanDesc scan, IndexPrefetch *prefetch, IndexPrefetchEntry *entry)
 {
-	BlockNumber block;
-
-	/* by default not all visible (or we didn't check) */
-	*all_visible = false;
+	BlockNumber block = ItemPointerGetBlockNumber(&entry->tid);
 
 	/*
 	 * No heap relation means bitmap index scan, which does prefetching at the
@@ -1355,6 +1352,8 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	 * XXX But in this case we should have prefetchMaxTarget=0, because in
 	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
 	 * just check that.
+	 *
+	 * XXX Comment/check seems obsolete.
 	 */
 	if (!prefetch)
 		return;
@@ -1362,37 +1361,10 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	/*
 	 * If we got here, prefetching is enabled and it's a node that supports
 	 * prefetching (i.e. it can't be a bitmap index scan).
-	 */
-	Assert(scan->heapRelation);
-
-	block = ItemPointerGetBlockNumber(tid);
-
-	/*
-	 * When prefetching for IOS, we want to only prefetch pages that are not
-	 * marked as all-visible (because not fetching all-visible pages is the
-	 * point of IOS).
 	 *
-	 * XXX This is not great, because it releases the VM buffer for each TID
-	 * we consider to prefetch. We should reuse that somehow, similar to the
-	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
-	 * we can propagate it here). Or at least do it for a bulk of prefetches,
-	 * although that's not very useful - after the ramp-up we will prefetch
-	 * the pages one by one anyway.
-	 *
-	 * XXX Ideally we'd also propagate this to the executor, so that the
-	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
-	 * is measurable). But the index_getnext_tid() is not really well
-	 * suited for that, so the API needs a change.s
+	 * XXX Comment/check seems obsolete.
 	 */
-	if (skip_all_visible)
-	{
-		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
-									  block,
-									  &prefetch->vmBuffer);
-
-		if (*all_visible)
-			return;
-	}
+	Assert(scan->heapRelation);
 
 	/*
 	 * Do not prefetch the same block over and over again,
@@ -1412,42 +1384,12 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	prefetch->countAll++;
 }
 
-/* ----------------
- * index_getnext_tid - get the next TID from a scan
- *
- * The result is the next TID satisfying the scan keys,
- * or NULL if no more matching tuples exist.
- *
- * FIXME not sure this handles xs_heapfetch correctly.
- * ----------------
+/*
+ * index_prefetch_tids
+ *		Fill the prefetch queue and issue necessary prefetch requests.
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction,
-				  IndexPrefetch *prefetch)
-{
-	bool		all_visible;	/* ignored */
-
-	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction, prefetch);
-
-	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
-}
-
-ItemPointer
-index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction,
-					 IndexPrefetch *prefetch, bool *all_visible)
-{
-	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction, prefetch);
-
-	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, prefetch, all_visible);
-}
-
 static void
-index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
-					IndexPrefetch *prefetch)
+index_prefetch_tids(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
 {
 	/*
 	 * If the prefetching is still active (i.e. enabled and we still
@@ -1473,43 +1415,46 @@ index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
 		 */
 		while (!PREFETCH_FULL(prefetch))
 		{
-			ItemPointer tid;
-			bool		all_visible;
+			IndexPrefetchEntry *entry = prefetch->next_cb(scan, prefetch, direction);
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid_internal(scan, direction);
-
-			/*
-			 * If we're out of index entries, we're done (and we mark the
-			 * the prefetcher as inactive).
-			 */
-			if (tid == NULL)
+			/* no more entries in this index scan */
+			if (entry == NULL)
 			{
 				prefetch->prefetchDone = true;
-				break;
+				return;
 			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			Assert(ItemPointerEquals(&entry->tid, &scan->xs_heaptid));
 
-			/*
-			 * Issue the actuall prefetch requests for the new TID.
-			 *
-			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
-			 * so skip prefetching of all-visible pages.
-			 */
-			index_prefetch(scan, prefetch, tid, prefetch->indexonly, &all_visible);
+			/* store the entry and then maybe issue the prefetch request */
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd++)] = *entry;
 
-			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
-			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
-			prefetch->queueEnd++;
+			/* issue the prefetch request? */
+			if (entry->prefetch)
+				index_prefetch_heap_page(scan, prefetch, entry);
 		}
 	}
 }
 
-static ItemPointer
-index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
-					   IndexPrefetch *prefetch, bool *all_visible)
+/*
+ * index_prefetch_get_entry
+ *		Get the next entry from the prefetch queue (or from the index directly).
+ *
+ * If prefetching is enabled, get next entry from the prefetch queue (unless
+ * queue is empty). With prefetching disabled, read an entry directly from the
+ * index scan.
+ *
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this? Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
+ */
+static IndexPrefetchEntry *
+index_prefetch_get_entry(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
 {
+	IndexPrefetchEntry *entry = NULL;
+
 	/*
 	 * With prefetching enabled (even if we already finished reading
 	 * all TIDs from the index scan), we need to return a TID from the
@@ -1522,25 +1467,98 @@ index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
 		if (PREFETCH_DONE(prefetch))
 			return NULL;
 
-		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
-		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		entry->data = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].data;
+
 		prefetch->queueIndex++;
+
+		scan->xs_heaptid = entry->tid;
 	}
 	else				/* not prefetching, just do the regular work  */
 	{
 		ItemPointer tid;
 
 		/* Time to fetch the next TID from the index */
-		tid = index_getnext_tid_internal(scan, direction);
-		*all_visible = false;
+		tid = index_getnext_tid(scan, direction);
 
 		/* If we're out of index entries, we're done */
 		if (tid == NULL)
 			return NULL;
 
 		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = scan->xs_heaptid;
+		entry->data = NULL;
 	}
 
-	/* Return the TID of the tuple we found. */
-	return &scan->xs_heaptid;
+	return entry;
+}
+
+int
+index_heap_prefetch_target(Relation heapRel, double plan_rows, bool allow_prefetch)
+{
+	/*
+	 * XXX No prefetching for direct I/O.
+	 *
+	 * XXX Shouldn't we do prefetching even for direct I/O? We would only pretend
+	 * doing it now, ofc, because we'd not do posix_fadvise(), but once the code
+	 * starts loading into shared buffers, that'd work.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+		return 0;
+
+	/* disable prefetching for cursors etc. */
+	if (!allow_prefetch)
+		return 0;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	return Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+			   plan_rows);
+}
+
+IndexPrefetch *
+IndexPrefetchAlloc(IndexPrefetchNextCB next_cb, int prefetch_max, void *data)
+{
+	IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+	prefetch->queueIndex = 0;
+	prefetch->queueStart = 0;
+	prefetch->queueEnd = 0;
+
+	prefetch->prefetchTarget = 0;
+	prefetch->prefetchMaxTarget = prefetch_max;
+
+	/*
+	 * Customize the prefetch to also check visibility map and keep
+	 * the result so that IOS does not need to repeat it.
+	 */
+	prefetch->next_cb = next_cb;
+	prefetch->data = data;
+
+	return prefetch;
+}
+
+IndexPrefetchEntry *
+IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, prefetch, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_entry(scan, prefetch, direction);
 }
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0a136db6712..2fa2118f3c2 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -777,11 +777,7 @@ retry:
 	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	/*
-	 * XXX Would be nice to also benefit from prefetching here. All we need to
-	 * do is instantiate the prefetcher, I guess.
-	 */
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot, NULL))
+	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9498b00fa64..81f27042bc4 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -212,13 +212,8 @@ retry:
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
-	/*
-	 * Try to find the tuple
-	 *
-	 * XXX Would be nice to also benefit from prefetching here. All we need to
-	 * do is instantiate the prefetcher, I guess.
-	 */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot, NULL))
+	/* Try to find the tuple */
+	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index a7eadaf3db2..af7dd364f33 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,12 +43,13 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
-
+static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
+												 IndexPrefetch *prefetch,
+												 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -66,7 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	TupleTableSlot *slot;
 	ItemPointer tid;
 	IndexPrefetch  *prefetch;
-	bool			all_visible;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -76,6 +77,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * Determine which direction to scan the index in based on the plan's scan
 	 * direction and the current direction of execution.
+	 *
+	 * XXX Could this be an issue for the prefetching? What if we prefetch something
+	 * but the direction changes before we get to the read? If that could happen,
+	 * maybe we should discard the prefetched data and go back? But can we even
+	 * do that, if we already fetched some TIDs from the index? I don't think
+	 * indexorderdir can't change, but es_direction maybe can?
 	 */
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
@@ -119,10 +126,15 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid_vm(scandesc, direction, prefetch, &all_visible)) != NULL)
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
+		bool	   *all_visible = NULL;
 		bool		tuple_from_heap = false;
 
+		/* unpack the entry */
+		tid = &entry->tid;
+		all_visible = (bool *) entry->data;	/* result of visibility check */
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -161,7 +173,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!all_visible &&
+		if (!(all_visible && *all_visible) &&
 			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
@@ -367,6 +379,9 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 		prefetch->queueIndex = 0;
 		prefetch->queueStart = 0;
 		prefetch->queueEnd = 0;
+
+		prefetch->prefetchDone = false;
+		prefetch->prefetchTarget = 0;
 	}
 
 	ExecScanReScan(&node->ss);
@@ -401,6 +416,8 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	{
 		IndexPrefetch *prefetch = node->ioss_prefetch;
 
+		Buffer *buffer = (Buffer *) prefetch->data;
+
 		/* XXX some debug info */
 		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
 			 prefetch->countAll,
@@ -409,10 +426,10 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 			 prefetch->countSkipCached,
 			 prefetch->countSkipSequential);
 
-		if (prefetch->vmBuffer != InvalidBuffer)
+		if (*buffer != InvalidBuffer)
 		{
-			ReleaseBuffer(prefetch->vmBuffer);
-			prefetch->vmBuffer = InvalidBuffer;
+			ReleaseBuffer(*buffer);
+			*buffer = InvalidBuffer;
 		}
 	}
 
@@ -512,6 +529,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	Relation	currentRelation;
 	LOCKMODE	lockmode;
 	TupleDesc	tupDesc;
+	int			prefetch_max;
 
 	/*
 	 * create state structure
@@ -641,61 +659,33 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Also initialize index prefetcher.
+	 * Also initialize index prefetcher. We do this even when prefetching is
+	 * not done (see index_heap_prefetch_calculate_target), because the
+	 * prefetcher is used for all index reads.
+	 *
+	 * We reach here if the index only scan is not parallel, or if we're
+	 * serially executing an index only scan that was planned to be
+	 * parallel.
 	 *
-	 * XXX No prefetching for direct I/O.
+	 * XXX Maybe we should enable prefetching, but prefetch only pages that
+	 * are not all-visible (but checking that from the index code seems like
+	 * a violation of layering etc).
+	 *
+	 * XXX This might lead to IOS being slower than plain index scan, if the
+	 * table has a lot of pages that need recheck.
+	 *
+	 * Remember this is index-only scan, because of prefetching. Not the most
+	 * elegant way to pass this info.
+	 *
+	 * XXX Maybe rename the object to "index reader" or something?
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
-	{
-		int			prefetch_max;
-		Relation    heapRel = indexstate->ss.ss_currentRelation;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index. This is
-		 * essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 *
-		 * XXX Maybe reduce the value with parallel workers?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   indexstate->ss.ps.plan->plan_rows);
-
-		/*
-		 * We reach here if the index only scan is not parallel, or if we're
-		 * serially executing an index only scan that was planned to be
-		 * parallel.
-		 *
-		 * XXX Maybe we should enable prefetching, but prefetch only pages that
-		 * are not all-visible (but checking that from the index code seems like
-		 * a violation of layering etc).
-		 *
-		 * XXX This might lead to IOS being slower than plain index scan, if the
-		 * table has a lot of pages that need recheck.
-		 *
-		 * Remember this is index-only scan, because of prefetching. Not the most
-		 * elegant way to pass this info.
-		 */
-		if (prefetch_max > 0)
-		{
-			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+	prefetch_max = index_heap_prefetch_target(indexstate->ss.ss_currentRelation,
+											  indexstate->ss.ps.plan->plan_rows,
+											  node->allow_prefetch);
 
-			prefetch->queueIndex = 0;
-			prefetch->queueStart = 0;
-			prefetch->queueEnd = 0;
-
-			prefetch->prefetchTarget = 0;
-			prefetch->prefetchMaxTarget = prefetch_max;
-			prefetch->vmBuffer = InvalidBuffer;
-			prefetch->indexonly = true;
-
-			indexstate->ioss_prefetch = prefetch;
-		}
-	}
+	indexstate->ioss_prefetch = IndexPrefetchAlloc(IndexOnlyPrefetchNext,
+												   prefetch_max,
+												   palloc0(sizeof(Buffer)));
 
 	/*
 	 * all done.
@@ -808,3 +798,42 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * When prefetching for IOS, we want to only prefetch pages that are not
+ * marked as all-visible (because not fetching all-visible pages is the
+ * point of IOS).
+ *
+ * The buffer used by the VM_ALL_VISIBLE() check is reused, similarly to
+ * ioss_VMBuffer (maybe we could/should use it here too?). We also keep
+ * the result of the all_visible flag, so that the main loop does not to
+ * do it again.
+ */
+static IndexPrefetchEntry *
+IndexOnlyPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer			tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		BlockNumber	blkno = ItemPointerGetBlockNumber(tid);
+
+		bool	all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+											 blkno,
+											 (Buffer *) prefetch->data);
+
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch only if not all visible */
+		entry->prefetch = !all_visible;
+
+		/* store the all_visible flag in the private part of the entry */
+		entry->data = palloc(sizeof(bool));
+		*(bool *) entry->data = all_visible;
+	}
+
+	return entry;
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index b3282ec5a75..bd65337270c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,7 +43,6 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -70,6 +69,9 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static IndexPrefetchEntry *IndexScanPrefetchNext(IndexScanDesc scan,
+												 IndexPrefetch *prefetch,
+												 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -87,6 +89,7 @@ IndexNext(IndexScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	IndexPrefetch  *prefetch;
+	IndexPrefetchEntry  *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -131,10 +134,19 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot, prefetch))
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scandesc->xs_heaptid));
+		if (!index_fetch_heap(scandesc, slot))
+			continue;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
@@ -180,7 +192,6 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
-	IndexPrefetch *prefetch;
 
 	estate = node->ss.ps.state;
 
@@ -197,7 +208,6 @@ IndexNextWithReorder(IndexScanState *node)
 	Assert(ScanDirectionIsForward(estate->es_direction));
 
 	scandesc = node->iss_ScanDesc;
-	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -264,7 +274,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot, prefetch))
+		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -601,6 +611,9 @@ ExecReScanIndexScan(IndexScanState *node)
 		prefetch->queueIndex = 0;
 		prefetch->queueStart = 0;
 		prefetch->queueEnd = 0;
+
+		prefetch->prefetchDone = false;
+		prefetch->prefetchTarget = 0;
 	}
 
 	ExecScanReScan(&node->ss);
@@ -917,6 +930,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	IndexScanState *indexstate;
 	Relation	currentRelation;
 	LOCKMODE	lockmode;
+	int			prefetch_max;
 
 	/*
 	 * create state structure
@@ -1095,43 +1109,33 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Also initialize index prefetcher.
+	 * Also initialize index prefetcher. We do this even when prefetching is
+	 * not done (see index_heap_prefetch_calculate_target), because the
+	 * prefetcher is used for all index reads.
+	 *
+	 * We reach here if the index only scan is not parallel, or if we're
+	 * serially executing an index only scan that was planned to be
+	 * parallel.
+	 *
+	 * XXX Maybe we should enable prefetching, but prefetch only pages that
+	 * are not all-visible (but checking that from the index code seems like
+	 * a violation of layering etc).
 	 *
-	 * XXX No prefetching for direct I/O.
+	 * XXX This might lead to IOS being slower than plain index scan, if the
+	 * table has a lot of pages that need recheck.
+	 *
+	 * Remember this is index-only scan, because of prefetching. Not the most
+	 * elegant way to pass this info.
+	 *
+	 * XXX Maybe rename the object to "index reader" or something?
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
-	{
-		int	prefetch_max;
-		Relation    heapRel = indexstate->ss.ss_currentRelation;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index scan. This
-		 * is essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   indexstate->ss.ps.plan->plan_rows);
-
-		if (prefetch_max > 0)
-		{
-			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
-
-			prefetch->queueIndex = 0;
-			prefetch->queueStart = 0;
-			prefetch->queueEnd = 0;
+	prefetch_max = index_heap_prefetch_target(indexstate->ss.ss_currentRelation,
+											  indexstate->ss.ps.plan->plan_rows,
+											  node->allow_prefetch);
 
-			prefetch->prefetchTarget = 0;
-			prefetch->prefetchMaxTarget = prefetch_max;
-			prefetch->vmBuffer = InvalidBuffer;
-
-			indexstate->iss_prefetch = prefetch;
-		}
-	}
+	indexstate->iss_prefetch = IndexPrefetchAlloc(IndexScanPrefetchNext,
+												  prefetch_max,
+												  NULL);
 
 	/*
 	 * all done.
@@ -1795,3 +1799,26 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+/*
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
+ */
+static IndexPrefetchEntry *
+IndexScanPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer			tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch always */
+		entry->prefetch = true;
+	}
+
+	return entry;
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 34ca6d4ac21..0abbcd31ddd 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -184,13 +184,15 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 bool allow_prefetch);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *recheckqual,
 										 List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 bool allow_prefetch);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3161,6 +3163,13 @@ create_indexscan_plan(PlannerInfo *root,
 		}
 	}
 
+	/*
+	 * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+	 * of a misuse of the flag, but we need to disable prefetching for cursors
+	 * (which might change direction), and parallelModeOK does that. But maybe
+	 * we might (or should) have a separate flag.
+	 */
+
 	/* Finally ready to build the plan node */
 	if (indexonly)
 		scan_plan = (Scan *) make_indexonlyscan(tlist,
@@ -3171,7 +3180,8 @@ create_indexscan_plan(PlannerInfo *root,
 												stripped_indexquals,
 												fixed_indexorderbys,
 												indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												root->glob->parallelModeOK);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3182,7 +3192,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											root->glob->parallelModeOK);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5522,7 +5533,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   bool allow_prefetch)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5539,6 +5551,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->allow_prefetch = allow_prefetch;
 
 	return node;
 }
@@ -5552,7 +5565,8 @@ make_indexonlyscan(List *qptlist,
 				   List *recheckqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   bool allow_prefetch)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5568,6 +5582,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->allow_prefetch = allow_prefetch;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index b5c79359425..e11d022827a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6297,7 +6297,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir, NULL)) != NULL)
+	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
 	{
 		BlockNumber block = ItemPointerGetBlockNumber(tid);
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c0c46d7a05f..f3452e8a799 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -129,7 +129,7 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
-
+/* index prefetching - probably should be somewhere else, outside indexam */
 
 /*
  * Cache of recently prefetched blocks, organized as a hash table of
@@ -159,6 +159,8 @@ typedef struct IndexPrefetchCacheEntry {
 
 /*
  * Used to detect sequential patterns (and disable prefetching).
+ *
+ * XXX seems strange to have two separate values
  */
 #define		PREFETCH_QUEUE_HISTORY			8
 #define		PREFETCH_SEQ_PATTERN_BLOCKS		4
@@ -166,9 +168,38 @@ typedef struct IndexPrefetchCacheEntry {
 typedef struct IndexPrefetchEntry
 {
 	ItemPointerData		tid;
-	bool				all_visible;
+
+	/* should we prefetch heap page for this TID? */
+	bool				prefetch;
+
+	/*
+	 * If a callback is specified, it may store per-tid information. The
+	 * data has to be a single palloc-ed piece of data, so that it can
+	 * be easily pfreed.
+	 *
+	 * XXX We could relax this by providing another cleanup callback, but
+	 * that seems unnecessarily complex - we expect the information to be
+	 * very simple, like bool flags or something. Easy to do in a simple
+	 * struct, and perhaps even reuse without pfree/palloc.
+	 */
+	void			    *data;
 } IndexPrefetchEntry;
 
+/* needs to be before IndexPrefetchCallback typedef */
+typedef struct IndexPrefetch IndexPrefetch;
+
+/*
+ * custom callback, allowing the user code to determine which TID to read
+ *
+ * If there is no TID to prefetch, the return value is expected to be NULL.
+ *
+ * Otherwise the "tid" field is expected to contain the TID to prefetch, and
+ * "data" may be set to custom information the callback needs to pass outside.
+ */
+typedef IndexPrefetchEntry *(*IndexPrefetchNextCB) (IndexScanDesc scan,
+													IndexPrefetch *state,
+													ScanDirection direction);
+
 typedef struct IndexPrefetch
 {
 	/*
@@ -187,9 +218,18 @@ typedef struct IndexPrefetch
 	uint64		countSkipSequential;
 	uint64		countSkipCached;
 
-	/* used when prefetching index-only scans */
-	bool		indexonly;
-	Buffer		vmBuffer;
+	/*
+	 * If a callback is specified, it may store global state (for all TIDs).
+	 * For example VM buffer may be kept during IOS. This is similar to the
+	 * data field in IndexPrefetchEntry, but that's per-TID.
+	 */
+	void	   *data;
+
+	/*
+	 * Callback to customize the prefetch (decide which block need to be
+	 * prefetched, etc.)
+	 */
+	IndexPrefetchNextCB	next_cb;
 
 	/*
 	 * Queue of TIDs to prefetch.
@@ -224,14 +264,22 @@ typedef struct IndexPrefetch
 
 } IndexPrefetch;
 
+IndexPrefetch *IndexPrefetchAlloc(IndexPrefetchNextCB next_cb,
+								  int prefetch_max, void *data);
+
+IndexPrefetchEntry *IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *state, ScanDirection direction);
+
 #define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
 #define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
 #define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
 #define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
 #define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+
+/* XXX easy to confuse with PREFETCH_ACTIVE */
 #define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
 #define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
 
+int index_heap_prefetch_target(Relation heapRel, double plan_rows, bool allow_prefetch);
 
 /*
  * generalized index_ interface routines (in indexam.c)
@@ -278,17 +326,11 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction,
-									 IndexPrefetch *prefetch);
-extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
-										ScanDirection direction,
-										IndexPrefetch *prefetch,
-										bool *all_visible);
+									 ScanDirection direction);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   struct TupleTableSlot *slot,
-							   IndexPrefetch *prefetch);
+							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8745453a5b4..cc891d4fccf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1529,7 +1529,6 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
-
 /* ----------------
  *	 IndexScanState information
  *
@@ -1582,6 +1581,7 @@ typedef struct IndexScanState
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
 
+	/* prefetching */
 	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
@@ -1621,6 +1621,8 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+
+	/* prefetching */
 	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d40af8e59fe..bc1029982cf 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -454,6 +454,7 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	bool		allow_prefetch;	/* allow prefetching of heap pages */
 } IndexScan;
 
 /* ----------------
@@ -496,6 +497,7 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	bool		allow_prefetch;	/* allow prefetching of heap pages */
 } IndexOnlyScan;
 
 /* ----------------
-- 
2.43.0

Re: index prefetching

Reply via email to