Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I'll briefly go through the main changes in the patch, and then will
respond in-line to Robert's points.


1) I moved the code from indexam.c to (new) execPrefetch.c. All the
prototypes / typedefs now live in executor.h, with only minimal changes
in execnodes.h (adding it to scan descriptors).

I believe this finally moves the code to the right place - it feels much
nicer and cleaner than in indexam.c.  And it allowed me to hide a bunch
of internal structs and improve the general API, I think.

I'm sure there's stuff that could be named differently, but the layering
feels about right, I think.


2) A bunch of stuff got renamed to start with IndexPrefetch... to make
the naming consistent / clearer. I'm not entirely sure IndexPrefetch is
the right name, though - it's still a bit misleading, as it might seem
it's about prefetching index stuff, but really it's about heap pages
from indexes. Maybe IndexScanPrefetch() or something like that?


3) If there's a way to make this work with the streaming I/O API, I'm
not aware of it. But the overall design seems somewhat similar (based on
"next" callback etc.) so hopefully that'd make it easier to adopt it.


4) I initially relied on parallelModeOK to disable prefetching, which
kinda worked, but not really. Robert suggested to use the execute_once
flag directly, and I think that's much better - not only is it cleaner,
it also seems more appropriate (the parallel flag considers other stuff
that is not quite relevant to prefetching).

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.


5) I haven't done anything about the xs_heap_continue=true case yet.


6) I went through all the comments and reworked them considerably. The
main comment at execPrefetch.c start, with some overall design etc. And
then there are comments for each function, explaining that bit in more
detail. Or at least that's the goal - there's still work to do.

There's two trivial FIXMEs, but you can ignore those - it's not that
there's a bug, but that I'd like to rework something and just don't know
how yet.

There's also a couple of XXX comments. Some are a bit wild ideas for the
future, others are somewhat "open questions" to be discussed during a
review.

Anyway, there should be no outright obsolete comments - if there's
something I missed, let me know.


Now to Robert's message ...


On 1/9/24 21:31, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> <tomas.von...@enterprisedb.com> wrote:
>> Here's a somewhat reworked version of the patch. My initial goal was to
>> see if it could adopt the StreamingRead API proposed in [1], but that
>> turned out to be less straight-forward than I hoped, for two reasons:
> 
> I guess we need Thomas or Andres or maybe Melanie to comment on this.
> 

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

>> Perhaps a bigger change is that I decided to move this into a separate
>> API on top of indexam.c. The original idea was to integrate this into
>> index_getnext_tid/index_getnext_slot, so that all callers benefit from
>> the prefetching automatically. Which would be nice, but it also meant
>> it's need to happen in the indexam.c code, which seemed dirty.
> 
> This patch is hard to review right now because there's a bunch of
> comment updating that doesn't seem to have been done for the new
> design. For instance:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> But not any more.
> 

True. And this is now even more obsolete, as the prefetching was moved
from indexam.c layer to the executor.

> + * XXX The prefetching may interfere with the patch allowing us to evaluate
> + * conditions on the index tuple, in which case we may not need the heap
> + * tuple. Maybe if there's such filter, we should prefetch only pages that
> + * are not all-visible (and the same idea would also work for IOS), but
> + * it also makes the indexing a bit "aware" of the visibility stuff (which
> + * seems a somewhat wrong). Also, maybe we should consider the filter
> selectivity
> 
> I'm not sure whether all the problems in this area are solved, but I
> think you've solved enough of them that this at least needs rewording,
> if not removing.
> 
> +     * XXX Comment/check seems obsolete.
> 
> This occurs in two places. I'm not sure if it's accurate or not.
> 
> +     * XXX Could this be an issue for the prefetching? What if we
> prefetch something
> +     * but the direction changes before we get to the read? If that
> could happen,
> +     * maybe we should discard the prefetched data and go back? But can we 
> even
> +     * do that, if we already fetched some TIDs from the index? I don't think
> +     * indexorderdir can't change, but es_direction maybe can?
> 
> But your email claims that "The patch simply disables prefetching for
> such queries, using the same logic that we do for parallelism." FWIW,
> I think that's a fine way to handle that case.
> 

True. I left behind this comment partly intentionally, to point out why
we disable the prefetching in these cases, but you're right the comment
now explains something that can't happen.

> +     * XXX Maybe we should enable prefetching, but prefetch only pages that
> +     * are not all-visible (but checking that from the index code seems like
> +     * a violation of layering etc).
> 
> Isn't this fixed now? Note this comment occurs twice.
> 
> +     * XXX We need to disable this in some cases (e.g. when using index-only
> +     * scans, we don't want to prefetch pages). Or maybe we should prefetch
> +     * only pages that are not all-visible, that'd be even better.
> 
> Here again.
> 

Sorry, you're right those comments (and a couple more nearby) were
stale. Removed / clarified.

> And now for some comments on other parts of the patch, mostly other
> XXX comments:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> There's probably no reason to write XXX here. The comment is fine.
> 
> +     * XXX Notice we haven't added the block to the block queue yet, and 
> there
> +     * is a preceding block (i.e. blockIndex-1 is valid).
> 
> Same here, possibly? If this XXX indicates a defect in the code, I
> don't know what the defect is, so I guess it needs to be more clear.
> If it is just explaining the code, then there's no reason for the
> comment to say XXX.
> 

Yeah, removed the XXX / reworded a bit.

> +     * XXX Could it be harmful that we read the queue backwards? Maybe memory
> +     * prefetching works better for the forward direction?
> 
> It does. But I don't know whether that matters here or not.
> 
> +             * XXX We do add the cache size to the request in order not to
> +             * have issues with uint64 underflows.
> 
> I don't know what this means.
> 

There's a check that does this:

      (x + PREFETCH_CACHE_SIZE) >= y

it might also be done as "mathematically equivalent"

      x >= (y - PREFETCH_CACHE_SIZE)

but if the "y" is an uint64, and the value is smaller than the constant,
this would underflow. It'd eventually disappear, once the "y" gets large
enough, ofc.

> + * XXX not sure this correctly handles xs_heap_continue - see
> index_getnext_slot,
> + * maybe nodeIndexscan needs to do something more to handle this?
> Although, that
> + * should be in the indexscan next_cb callback, probably.
> + *
> + * XXX If xs_heap_continue=true, we need to return the last TID.
> 
> You've got a bunch of comments about xs_heap_continue here -- and I
> don't fully understand what the issues are here with respect to this
> particular patch, but I think that the general purpose of
> xs_heap_continue is to handle the case where we need to return more
> than one tuple from the same HOT chain. With an MVCC snapshot that
> doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
> As far as possible, the prefetcher shouldn't be involved at all when
> xs_heap_continue is set, I believe, because in that case we're just
> returning a bunch of tuples from the same page, and the extra fetches
> from that heap page shouldn't trigger or require any further
> prefetching.
> 

Yes, that's correct. The current code simply ignores that flag and just
proceeds to the next TID. Which is correct for xs_heap_continue=false,
and thus all MVCC snapshots work fine. But for the Any/Dirty case it
needs to work a bit differently.

> +     * XXX Should this also look at plan.plan_rows and maybe cap the target
> +     * to that? Pointless to prefetch more than we expect to use. Or maybe
> +     * just reset to that value during prefetching, after reading the next
> +     * index page (or rather after rescan)?
> 
> It seems questionable to use plan_rows here because (1) I don't think
> we have existing cases where we use the estimated row count in the
> executor for anything, we just carry it through so EXPLAIN can print
> it and (2) row count estimates can be really far off, especially if
> we're on the inner side of a nested loop, we might like to figure that
> out eventually instead of just DTWT forever. But on the other hand
> this does feel like an important case where we have a clue that
> prefetching might need to be done less aggressively or not at all, and
> it doesn't seem right to ignore that signal either. I wonder if we
> want this shaped in some other way, like a Boolean that says
> are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
> side of a semi-join or anti-join.
> 

The current code actually does look at plan_rows when calculating the
prefetch target:

  prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
                                            node->ss.ps.plan->plan_rows,
                                            estate->es_use_prefetching);

but I agree maybe it should not, for the reasons you explain. I'm not
attached to this part.


> +     * We reach here if the index only scan is not parallel, or if we're
> +     * serially executing an index only scan that was planned to be
> +     * parallel.
> 
> Well, this seems sad.
> 

Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

> +     * XXX This might lead to IOS being slower than plain index scan, if the
> +     * table has a lot of pages that need recheck.
> 
> How?
> 

The comment is not particularly clear what "this" means, but I believe
this was about index-only scan with many not-all-visible pages. If it
didn't do prefetching, a regular index scan with prefetching may be way
faster. But the code actually allows doing prefetching even for IOS, by
checking the vm in the "next" callback.

> +    /*
> +     * XXX Only allow index prefetching when parallelModeOK=true. This is a 
> bit
> +     * of a misuse of the flag, but we need to disable prefetching for 
> cursors
> +     * (which might change direction), and parallelModeOK does that. But 
> maybe
> +     * we might (or should) have a separate flag.
> +     */
> 
> I think the correct flag to be using here is execute_once, which
> captures whether the executor could potentially be invoked a second
> time for the same portal. Changes in the fetch direction are possible
> if and only if !execute_once.
> 

Right. The new patch version does that.

>> Note 1: The IndexPrefetch name is a bit misleading, because it's used
>> even with prefetching disabled - all index reads from the index scan
>> happen through it. Maybe it should be called IndexReader or something
>> like that.
> 
> My biggest gripe here is the capitalization. This version adds, inter
> alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
> index_heap_prefetch_target, which seems like one or two too many
> conventions. But maybe the PREFETCH_* macros don't even belong in a
> public header.
> 
> I do like the index_heap_prefetch_* naming. Possibly that's too
> verbose to use for everything, but calling this index-heap-prefetch
> rather than index-prefetch seems clearer.
> 

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From a3f99cc0aaa64ef94b09fc0a58bee709cd29add9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240112] Prefetch heap pages during index scans

Index scans are a significant source of random I/O on the indexed heap,
but can't benefit from kernel read-ahead. For bitmap scans that is not
an issue, because they do prefetch explicitly, but for plain index scans
this is a major bottleneck - reading page at a time does not allow
saturating modern storage systems.

This enhances index scans (including index-only scans) to prefetch heap
pages. The scan maintains a queue of future TIDs received from an index,
prefetch the associated heap page, and then eventually pass the TID to
the caller.

To eliminate unnecessary prefetches, a small cache of recent prefetches
is maintained, and the prefetches are skipped. Furthermore, sequential
access patterns are detected and not prefetched, on the assumption that
the kernel read-ahead will do this more efficiently.

These optimizations are best-effort heuristics - we don't know if the
kernel will actually prefetch the pages on it's own, and we can't easily
check that. Moreover, different kernels (and kernel) versions may behave
differently.

Note: For shared buffers we can easily check if a page is cached, and
the PrefetchBuffer() function already takes care of that. These
optimizations are primarily about the page cache.

The prefetching is also disabled for plans that may not be executed only
once - these plans may change direction, interfering with the prefetch
queue. Consider scrollable cursors with backwards scans. This might get
improved to allow the prefetcher to handle direction changes, but it's
not clear if it's worth it.

Note: If a plan changes the scan direction, that inherently wastes the
issued prefetches. If the direction changes often, it likely means a lot
of the pages are still cached. Similarly, if a plan pauses for a long
time, the already prefetched pages may get evicted.
---
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/Makefile            |   1 +
 src/backend/executor/execMain.c          |  12 +
 src/backend/executor/execPrefetch.c      | 884 +++++++++++++++++++++++
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c | 113 ++-
 src/backend/executor/nodeIndexscan.c     |  68 +-
 src/include/executor/executor.h          |  52 ++
 src/include/executor/instrument.h        |   2 +
 src/include/nodes/execnodes.h            |  10 +
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1162 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/executor/execPrefetch.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 3d590a6b9f5..9bbe270ab7d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..840f5a6596a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,6 +24,7 @@ OBJS = \
 	execMain.o \
 	execParallel.o \
 	execPartition.o \
+	execPrefetch.o \
 	execProcnode.o \
 	execReplication.o \
 	execSRF.o \
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 13a9b7da83b..e3e9131bd62 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1645,6 +1645,18 @@ ExecutePlan(EState *estate,
 	 */
 	estate->es_direction = direction;
 
+	/*
+	 * Enable prefetching only if the plan is executed exactly once. We need
+	 * to disable prefetching for cases when the scan direction may change
+	 * (e.g. for scrollable cursors).
+	 *
+	 * XXX It might be possible to improve the prefetching code to handle this
+	 * by "walking back" the TID queue, but it's not clear if it's worth it.
+	 * And if there pauses in between the fetches, the prefetched pages may
+	 * get evicted, wasting the prefetch effort.
+	 */
+	estate->es_use_prefetching = execute_once;
+
 	/*
 	 * If the plan might potentially be executed multiple times, we must force
 	 * it to run without parallelism, because we might exit early.
diff --git a/src/backend/executor/execPrefetch.c b/src/backend/executor/execPrefetch.c
new file mode 100644
index 00000000000..000bb796d51
--- /dev/null
+++ b/src/backend/executor/execPrefetch.c
@@ -0,0 +1,884 @@
+/*-------------------------------------------------------------------------
+ *
+ * execPrefetch.c
+ *	  routines for prefetching heap pages for index scans.
+ *
+ * The IndexPrefetch node represents an "index prefetcher" which reads TIDs
+ * from an index scan, and prefetches the referenced heap pages. The basic
+ * API consists of these methods:
+ *
+ *	IndexPrefetchAlloc - allocate IndexPrefetch with custom callbacks
+ *	IndexPrefetchNext - read next TID from the index scan, do prefetches
+ *	IndexPrefetchReset - reset state of the prefetcher (for rescans)
+ *	IndexPrefetchEnd - release resources held by the prefetcher
+ *
+ * When allocating a prefetcher, the caller can supply two custom callbacks:
+ *
+ *	IndexPrefetchNextCB - reads the next TID from the index scan (required)
+ *	IndexPrefetchCleanupCB - release private prefetch data (optional)
+ *
+ * These callbacks allow customizing the behavior for different types of
+ * index scans - for exampel index-only scans may inspect visibility map,
+ * and adjust prefetches based on that.
+ *
+ *
+ * TID queue
+ * ---------
+ * The prefetcher maintains a simple queue of TIDs fetched from the index.
+ * The length of the queue (number of TIDs) is determined by the prefetch
+ * target, i.e. effective_io_concurrency. Adding entries to the queue is
+ * the responsibility of IndexPrefetchFillQueue(), depending on the state
+ * of the scan etc. It also prefetches the pages, if appropriate.
+ *
+ * Note: This prefetching applies only to heap pages from the indexed
+ * relation, not the internal index pages.
+ *
+ *
+ * pattern detection
+ * -----------------
+ * For certain access patterns, prefetching is inefficient. In particular,
+ * this applies to sequential access (where kernel read-ahead works fine)
+ * and for pages that are already in memory (prefetched recently). The
+ * prefetcher attempts to identify these two cases - sequential patterns
+ * are detected by IndexPrefetchBlockIsSequential, usign a tiny queue of
+ * recently prefetched blocks. Recently prefetched blocks are tracked in
+ * a "partitioned" LRU cache.
+ *
+ * Note: These are inherently best-effort heuristics. We don't know what
+ * the kernel algorithm/configuration is, or more precisely what already
+ * is in page cache.
+ *
+ *
+ * cache of recent prefetches
+ * --------------------------
+ * Cache of recently prefetched blocks, organized as a hash table of LRU
+ * LRU caches. Doesn't need to be perfectly accurate, but we aim to make
+ * false positives/negatives reasonably low. For more details see the
+ * comments at IndexPrefetchIsCached.
+ *
+ *
+ * prefetch request number
+ * -----------------------
+ * Prefetching works with the concept of "age" (e.g. "recently prefetched
+ * pages"). This relies on a simple prefetch counter, incremented every
+ * time a prefetch is issued. This is not exactly the same thing as time,
+ * as there may be arbitrary delays, it's good enough for this purpose.
+ *
+ *
+ * auto-tuning / self-adjustment
+ * -----------------------------
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and adjust the prefetch target accordingly. That'd help
+ * with cases when a scan matches only very few rows, far less than the
+ * prefetchTarget, because the unnecessary prefetches are wasted I/O.
+ * Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchRequest value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execPrefetch.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "catalog/index.h"
+#include "common/hashfn.h"
+#include "executor/executor.h"
+#include "nodes/nodeFuncs.h"
+#include "storage/bufmgr.h"
+#include "utils/spccache.h"
+
+
+/*
+ * An entry representing a recently prefetched block. For each block we know
+ * the request number, assigned sequentially, allowing us to decide how old
+ * the request is.
+ *
+ * XXX Is it enough to keep the request as uint32? This way we can prefetch
+ * 32TB of data, and this allows us to fit the whole entry into 64B, i.e.
+ * one cacheline. Which seems like a good thing.
+ *
+ * XXX If we're extra careful / paranoid about uint32, we could reset the
+ * cache once the request wraps around.
+ */
+typedef struct IndexPrefetchCacheEntry
+{
+	BlockNumber block;
+	uint32		request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too small or
+ * too large. 1024 entries seems about right, it covers ~8MB of data. This is
+ * rather arbitrary - there's no formula that'd tell us what the optimal size
+ * is, and we can't even tune it based on runtime (as it depends on what the
+ * other backends do too).
+ *
+ * A value too small would mean we may issue unnecessary prefetches for pages
+ * that have already been prefetched recently (and are still in page cache),
+ * incurring costs for unnecessary fadvise() calls.
+ *
+ * A value too large would mean we do not issue prefetches for pages that have
+ * already been evicted from memory (both shared buffers and page cache).
+ *
+ * Note however that PrefetchBuffer() checks shared buffers before doing the
+ * fadvise call, which somewhat limits the risk of a small cache - the page
+ * would have to get evicted from shared buffers not yet from page cache.
+ * Also, the cost of not issuing a fadvise call (and doing synchronous I/O
+ * later) is much higher than the unnecessary fadvise call. For these reasons
+ * it's better to keep the cache fairly small.
+ *
+ * The cache is structured as an array of small LRU caches - you may also
+ * imagine it as a hash table of LRU caches. To remember a prefetched block,
+ * the block number mapped to a LRU using by hashing. And then in each LRU
+ * we organize the entries by age (per request number) - in particular, the
+ * age determines which entry gets evicted after the LRU gets full.
+ *
+ * The LRU needs to be small enough to be searched linearly. At the same
+ * time it needs to be sufficiently large to handle collisions when several
+ * hot blocks get mapped to the same LRU. For example, if the LRU was only
+ * a single entry, and there were two hot blocks mapped to it, that would
+ * often give incorrect answer.
+ *
+ * The 8 entries per LRU seems about right - it's small enough for linear
+ * search to work well, but large enough to be adaptive. It's not very
+ * likely for 9+ busy blocks (out of 1000 recent requests) to map to the
+ * same LRU. Assuming reasonable hash function.
+ *
+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#define		PREFETCH_LRU_SIZE		8	/* slots in one LRU */
+#define		PREFETCH_LRU_COUNT		128 /* number of LRUs */
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Size of small sequential queue of most recently prefetched blocks, used
+ * to check if the block is exactly the same as the immediately preceding
+ * one (in which case prefetching is not needed), and if the blocks are a
+ * sequential pattern (in which case the kernel read-ahead is likely going
+ * to be more efficient, and we don't want to interfere with it).
+ */
+#define		PREFETCH_QUEUE_HISTORY	8
+
+/*
+ * An index prefetcher, which maintains a queue of TIDs from an index, and
+ * issues prefetches (if deemed beneficial and supported by the OS).
+ */
+typedef struct IndexPrefetch
+{
+	int			prefetchTarget; /* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics, displayed in EXPLAIN etc. */
+	uint32		countAll;		/* all prefetch requests (including skipped) */
+	uint32		countPrefetch;	/* PrefetchBuffer calls */
+	uint32		countSkipSequential;	/* skipped as sequential pattern */
+	uint32		countSkipCached;	/* skipped as recently prefetched */
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values. However, 1000 entries
+	 * means ~16kB, which means an oversized chunk, and thus always a malloc()
+	 * call. However, we already have the prefetchCache, which is also large
+	 * enough to cause this :-(
+	 *
+	 * XXX However what about the case without prefetching? In that case it
+	 * would be nice to lower the malloc overhead, maybe?
+	 */
+	IndexPrefetchEntry queueItems[MAX_IO_CONCURRENCY];
+	uint32		queueIndex;		/* next TID to prefetch */
+	uint32		queueStart;		/* first valid TID in queue */
+	uint32		queueEnd;		/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber blockItems[PREFETCH_QUEUE_HISTORY];
+	uint32		blockIndex;		/* index in the block (points to the first
+								 * empty entry) */
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of small
+	 * LRU caches.
+	 */
+	uint32		prefetchRequest;
+	IndexPrefetchCacheEntry prefetchCache[PREFETCH_CACHE_SIZE];
+
+
+	/*
+	 * Callback to customize the prefetch (decide which block need to be
+	 * prefetched, etc.)
+	 */
+	IndexPrefetchNextCB next_cb;	/* read next TID */
+	IndexPrefetchCleanupCB cleanup_cb;	/* cleanup data */
+
+	/*
+	 * If a callback is specified, it may store global state (for all TIDs).
+	 * For example VM buffer may be kept during IOS. This is similar to the
+	 * data field in IndexPrefetchEntry, but that's per-TID.
+	 */
+	void	   *data;
+} IndexPrefetch;
+
+/* small sequential queue of recent blocks */
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+/* access to the main hybrid cache (hash of LRUs) */
+#define PREFETCH_LRU_ENTRY(p, lru, idx)	\
+	&((p)->prefetchCache[(lru) * PREFETCH_LRU_SIZE + (idx)])
+
+/* access to queue of TIDs (up to MAX_IO_CONCURRENCY elements) */
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+
+/*
+ * macros to deal with prefetcher state
+ *
+ * FIXME may need rethinking, easy to confuse PREFETCH_ENABLED/PREFETCH_ACTIVE
+ */
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_QUEUE_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+
+
+/*
+ * IndexPrefetchBlockIsSequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the block is the same as the immediately preceding one.
+ *
+ * This also updates the small sequential cache of blocks.
+ *
+ * The prefetching overhead is fairly low, but for some access patterns the
+ * benefits are small compared to the extra overhead, or the prefetching may
+ * even be harmful. In particular, for sequential access the read-ahead
+ * performed by the OS is very effective/efficient and our prefetching may
+ * be pointless or (worse) even interfere with it.
+ *
+ * This identifies simple sequential patterns, using a tiny queue of recently
+ * prefetched block numbers (PREFETCH_QUEUE_HISTORY blocks). It also checks
+ * if the block is exactly the same as any of the blocks in the queue (the
+ * main cache has block too, but checking the tiny cache is likely cheaper).
+ *
+ * The the main prefetch queue is not really useful for this, as it stores
+ * full TIDs, but while we only care about block numbers. Consider a nicely
+ * clustered table, with a perfectly sequential pattern when accessed through
+ * an index. Each heap page may have dozens of TIDs, filling the prefetch
+ * queue. But we need to compare block numbers - those may either not be
+ * in the queue anymore, or we have to walk many TIDs (making it expensive,
+ * and we're in hot path).
+ *
+ * So a tiny queue of just block numbers seems like a better option.
+ *
+ * Returns true if the block is in a sequential pattern or was prefetched
+ * recently (and so should not be prefetched this time), or false (in which
+ * case it should be prefetched).
+ */
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on IndexPrefetchIsCached doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the small queue.
+	 *
+	 * Done before checking if the pattern is sequential, because we want to
+	 * know about the block later, even if we end up skipping the prefetch.
+	 * Otherwise we'd not be able to detect longer sequential pattens - we'd
+	 * skip one block and then fail to skip the next couple blocks even in a
+	 * perfectly sequential pattern. And this ocillation might even prevent
+	 * the OS read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Are there enough requests to confirm a sequential pattern? We only
+	 * consider something to be sequential after finding a sequence of
+	 * PREFETCH_QUEUE_HISTORY blocks.
+	 */
+	if (prefetch->blockIndex < PREFETCH_QUEUE_HISTORY)
+		return false;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_QUEUE_HISTORY (8 by default), so
+	 * we look for patterns of 8 pages (64kB) including the new block.
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_QUEUE_HISTORY; i++)
+	{
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index after adding the new block to the
+		 * queue). So (blockIndex-1) is the new block.
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+
+		/* Don't prefetch if the block happens to be the same. */
+		if (prefetch->blockItems[idx] == block)
+			return false;
+	}
+
+	/* not sequential, not recently prefetched */
+	return true;
+}
+
+/*
+ * IndexPrefetchIsCached
+ *		Check if the block was prefetched recently, and update the cache.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may be quite significant.
+ *
+ * We want to remember which blocks were prefetched recently, so that we can
+ * skip repeated prefetches. We also need to eventually forget these blocks
+ * as they may get evicted from memory (particularly page cache, which is
+ * outside our control).
+ *
+ * A simple queue is not a viable option - it would allow expiring requests
+ * based on age, but it's very expensive to check (as it requires linear
+ * search, and we need fairly large number of entries). Hash table does not
+ * work because it does not allow expiring entries by age.
+ *
+ * The cache does not need to be perfect - false positives/negatives are
+ * both acceptable, as long as the rate is reasonably low.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table of LRUs). The LRU caches are tiny (e.g. 8 entries), and the
+ * expiration happens at the level of a single LRU (using age determined
+ * by sequential request number).
+ *
+ * This allows quick searches and expiration, with false negatives (when a
+ * particular LRU has too many collisions with hot blocks, we may end up
+ * evicting entries that are more recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * request in total (these are the default parameters.) representing about
+ * 8MB of data.
+ *
+ * If we want to check if a block was recently prefetched, we calculate
+ * (hash(blkno) % 128) and search only LRU at this index, using a linear
+ * search. If we want to add the block to the cache, we find either an
+ * empty slot or the "oldest" entry in the LRU, and store the block in it.
+ * If the block is already in the LRU, we only update the request number.
+ *
+ * The request age is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint32, so it should
+ * not wrap (we'd have to prefetch 32TB).
+ *
+ * If the request number is not less than PREFETCH_CACHE_SIZE ago, it's
+ * considered "recently prefetched". That is, the maximum age is the same
+ * as the total capacity of the cache.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what IndexPrefetchBlockIsSequential does.
+ *
+ * XXX Should we increase the prefetch counter even if we determine the
+ * entry was recently prefetched? Then we might skip some request numbers
+ * (there's be no entry with them).
+ */
+static bool
+IndexPrefetchIsCached(IndexPrefetch *prefetch, BlockNumber block)
+{
+	IndexPrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) queue and see if it's part of a
+	 * sequential pattern. In this case we just ignore the block and don't
+	 * prefetch it - we expect OS read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the main cache, in case we
+	 * happen to access it later. That might help if we happen to scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (IndexPrefetchBlockIsSequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* Which LRU does this block belong to? */
+	lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/*
+	 * Did we prefetch this block recently? Scan the LRU linearly, and while
+	 * doing that, track the oldest (or empty) entry, so that we know where to
+	 * put the block if we don't find a match.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = PREFETCH_LRU_ENTRY(prefetch, lru, i);
+
+		/*
+		 * Is this the oldest prefetch request in this LRU?
+		 *
+		 * Notice that request is uint32, so an empty entry (with request=0)
+		 * is automatically oldest one.
+		 */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Skip unused entries. */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched. We need to check before updating
+			 * the prefetch request.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchRequest);
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchRequest;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it the "oldest" prefetch
+	 * request in this LRU (which might be an empty entry).
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = PREFETCH_LRU_ENTRY(prefetch, lru, oldestIndex);
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchRequest;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * IndexPrefetchHeapPage
+ *		Prefetch a heap page for the TID, unless it's sequential or was
+ *		recently prefetched.
+ */
+static void
+IndexPrefetchHeapPage(IndexScanDesc scan, IndexPrefetch *prefetch, IndexPrefetchEntry *entry)
+{
+	BlockNumber block = ItemPointerGetBlockNumber(&entry->tid);
+
+	prefetch->countAll++;
+
+	/*
+	 * Do not prefetch the same block over and over again, if it's probably
+	 * still in memory (page cache).
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 *
+	 * If we make a mistake and prefetch a buffer that's still in our shared
+	 * buffers, PrefetchBuffer will take care of that. If it's in page cache,
+	 * we'll issue an unnecessary prefetch. There's not much we can do about
+	 * that, unfortunately.
+	 *
+	 * XXX Maybe we could check PrefetchBufferResult and adjust countPrefetch
+	 * based on that?
+	 */
+	if (IndexPrefetchIsCached(prefetch, block))
+		return;
+
+	prefetch->countPrefetch++;
+
+	PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+	pgBufferUsage.blks_prefetches++;
+}
+
+/*
+ * IndexPrefetchFillQueue
+ *		Fill the prefetch queue and issue necessary prefetch requests.
+ *
+ * If the prefetching is still active (enabled, not reached end of scan), read
+ * TIDs into the queue until we hit the current target.
+ *
+ * This also ramps-up the prefetch target from 0 to prefetch_max, determined
+ * when allocating the prefetcher.
+ */
+static void
+IndexPrefetchFillQueue(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* When inactive (not enabled or end of scan reached), we're done. */
+	if (!PREFETCH_ACTIVE(prefetch))
+		return;
+
+	/*
+	 * Ramp up the prefetch distance incrementally.
+	 *
+	 * Intentionally done as first, before reading the TIDs into the queue, so
+	 * that there's always at least one item. Otherwise we might get into a
+	 * situation where we start with target=0 and no TIDs loaded.
+	 */
+	prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+								   prefetch->prefetchMaxTarget);
+
+	/*
+	 * Read TIDs from the index until the queue is full (with respect to the
+	 * current prefetch target).
+	 */
+	while (!PREFETCH_QUEUE_FULL(prefetch))
+	{
+		IndexPrefetchEntry *entry
+		= prefetch->next_cb(scan, direction, prefetch->data);
+
+		/* no more entries in this index scan */
+		if (entry == NULL)
+		{
+			prefetch->prefetchDone = true;
+			return;
+		}
+
+		Assert(ItemPointerEquals(&entry->tid, &scan->xs_heaptid));
+
+		/* store the entry and then maybe issue the prefetch request */
+		prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd++)] = *entry;
+
+		/* issue the prefetch request? */
+		if (entry->prefetch)
+			IndexPrefetchHeapPage(scan, prefetch, entry);
+	}
+}
+
+/*
+ * IndexPrefetchNextEntry
+ *		Get the next entry from the prefetch queue (or from the index directly).
+ *
+ * If prefetching is enabled, get next entry from the prefetch queue (unless
+ * queue is empty). With prefetching disabled, read an entry directly from the
+ * index scan.
+ *
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this? Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
+ */
+static IndexPrefetchEntry *
+IndexPrefetchNextEntry(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+
+	/*
+	 * With prefetching enabled (even if we already finished reading all TIDs
+	 * from the index scan), we need to return a TID from the queue.
+	 * Otherwise, we just get the next TID from the scan directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		entry->data = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].data;
+
+		prefetch->queueIndex++;
+
+		scan->xs_heaptid = entry->tid;
+	}
+	else						/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid(scan, direction);
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = scan->xs_heaptid;
+		entry->data = NULL;
+	}
+
+	return entry;
+}
+
+/*
+ * IndexPrefetchComputeTarget
+ *		Calculate prefetch distance for the given heap relation.
+ *
+ * We disable prefetching when using direct I/O (when there's no page cache
+ * to prefetch into), and scans where the prefetch distance may change (e.g.
+ * for scrollable cursors).
+ *
+ * In regular cases we look at effective_io_concurrency for the tablepace
+ * (of the heap, not the index), and cap it with plan_rows.
+ *
+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.
+ *
+ * XXX Maybe we should reduce the value with parallel workers?
+ */
+int
+IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch)
+{
+	/*
+	 * No prefetching for direct I/O.
+	 *
+	 * XXX Shouldn't we do prefetching even for direct I/O? We would only
+	 * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
+	 * once the code starts loading into shared buffers, that'd work.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) != 0)
+		return 0;
+
+	/* disable prefetching (for cursors etc.) */
+	if (!prefetch)
+		return 0;
+
+	/* regular case, look at tablespace effective_io_concurrency */
+	return Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+			   plan_rows);
+}
+
+/*
+ * IndexPrefetchAlloc
+ *		Allocate the index prefetcher.
+ *
+ * The behavior is customized by two callbacks - next_cb, which generates TID
+ * values to put into the prefetch queue, and (optional) cleanup_cb which
+ * releases resources at the end.
+ *
+ * prefetch_max specifies the maximum prefetch distance, i.e. how many TIDs
+ * ahead to keep in the prefetch queue. prefetch_max=0 means prefetching is
+ * disabled.
+ *
+ * data may point to a custom data, associated with the prefetcher.
+ */
+IndexPrefetch *
+IndexPrefetchAlloc(IndexPrefetchNextCB next_cb, IndexPrefetchCleanupCB cleanup_cb,
+				   int prefetch_max, void *data)
+{
+	IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+	/* the next_cb callback is required */
+	Assert(next_cb);
+
+	/* valid prefetch distance */
+	Assert((prefetch_max >= 0) && (prefetch_max <= MAX_IO_CONCURRENCY));
+
+	prefetch->queueIndex = 0;
+	prefetch->queueStart = 0;
+	prefetch->queueEnd = 0;
+
+	prefetch->prefetchTarget = 0;
+	prefetch->prefetchMaxTarget = prefetch_max;
+
+	/*
+	 * Customize the prefetch to also check visibility map and keep the result
+	 * so that IOS does not need to repeat it.
+	 */
+	prefetch->next_cb = next_cb;
+	prefetch->cleanup_cb = cleanup_cb;
+	prefetch->data = data;
+
+	return prefetch;
+}
+
+/*
+ * IndexPrefetchNext
+ *		Read the next entry from the prefetch queue.
+ *
+ * Returns the next TID in the prefetch queue (which might have been prefetched
+ * sometime in the past). If needed, it adds more entries to the queue and does
+ * the prefetching for them.
+ *
+ * Returns IndexPrefetchEntry with the TID and optional data associated with
+ * the TID in the next_cb callback.
+ */
+IndexPrefetchEntry *
+IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	IndexPrefetchFillQueue(scan, prefetch, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return IndexPrefetchNextEntry(scan, prefetch, direction);
+}
+
+/*
+ * IndexPrefetchReset
+ *		Reset the prefetch TID, restart the prefetching.
+ *
+ * Useful during rescans etc. This also resets the prefetch target, so that
+ * each rescan does the initial prefetch ramp-up from target=0 to maximum
+ * prefetch distance.
+ */
+void
+IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	state->queueIndex = 0;
+	state->queueStart = 0;
+	state->queueEnd = 0;
+
+	state->prefetchDone = false;
+	state->prefetchTarget = 0;
+}
+
+/*
+ * IndexPrefetchStats
+ *		Log basic runtime debug stats of the prefetcher.
+ *
+ * FIXME Should be only in debug builds, or something like that.
+ */
+void
+IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	elog(LOG, "index prefetch stats: requests %u prefetches %u (%f) skip cached %u sequential %u",
+		 state->countAll,
+		 state->countPrefetch,
+		 state->countPrefetch * 100.0 / state->countAll,
+		 state->countSkipCached,
+		 state->countSkipSequential);
+}
+
+/*
+ * IndexPrefetchEnd
+ *		Release resources associated with the prefetcher.
+ *
+ * This is primarily about the private data the caller might have allocated
+ * in the next_cb, and stored in the data field. We don't know what the
+ * data might contain (e.g. buffers etc.), requiring additional cleanup, so
+ * we call another custom callback.
+ *
+ * Needs to be called at the end of the executor node.
+ *
+ * XXX Maybe if there's no callback, we should just pfree the data? Does
+ * not seem very useful, though.
+ */
+void
+IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	if (!state->cleanup_cb)
+		return;
+
+	state->cleanup_cb(scan, state->data);
+}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 268ae8a945f..8fda8694350 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b57..fce10ea6518 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -36,6 +36,7 @@
 #include "access/tupdesc.h"
 #include "access/visibilitymap.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
@@ -44,11 +45,14 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
-
+static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
+static void IndexOnlyPrefetchCleanup(IndexScanDesc scan,
+									 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +69,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -78,11 +84,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
@@ -111,15 +120,39 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->ioss_prefetch = IndexPrefetchAlloc(IndexOnlyPrefetchNext,
+												 IndexOnlyPrefetchCleanup,
+												 prefetch_max,
+												 palloc0(sizeof(Buffer)));
 	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
+		bool	   *all_visible = NULL;
 		bool		tuple_from_heap = false;
 
+		/* unpack the entry */
+		tid = &entry->tid;
+		all_visible = (bool *) entry->data; /* result of visibility check */
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -155,8 +188,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from
+		 * prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!(all_visible && *all_visible) &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -353,6 +390,9 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->ioss_ScanDesc, node->ioss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -380,6 +420,12 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->ioss_prefetch);
+
+	/* Release VM buffer pin from prefetcher, if any. */
+	IndexPrefetchEnd(indexScanDesc, node->ioss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -715,3 +761,62 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * When prefetching for IOS, we want to only prefetch pages that are not
+ * marked as all-visible (because not fetching all-visible pages is the
+ * point of IOS).
+ *
+ * The buffer used by the VM_ALL_VISIBLE() check is reused, similarly to
+ * ioss_VMBuffer (maybe we could/should use it here too?). We also keep
+ * the result of the all_visible flag, so that the main loop does not to
+ * do it again.
+ */
+static IndexPrefetchEntry *
+IndexOnlyPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	Assert(data);
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+		bool		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+												 blkno,
+												 (Buffer *) data);
+
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch only if not all visible */
+		entry->prefetch = !all_visible;
+
+		/* store the all_visible flag in the private part of the entry */
+		entry->data = palloc(sizeof(bool));
+		*(bool *) entry->data = all_visible;
+	}
+
+	return entry;
+}
+
+/*
+ * For IOS, we may have a VM buffer in the private data, so make sure to
+ * release it properly.
+ */
+static void
+IndexOnlyPrefetchCleanup(IndexScanDesc scan, void *data)
+{
+	Buffer	   *buffer = (Buffer *) data;
+
+	Assert(data);
+
+	if (*buffer != InvalidBuffer)
+	{
+		ReleaseBuffer(*buffer);
+		*buffer = InvalidBuffer;
+	}
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a946..0548403dc50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -34,6 +34,7 @@
 #include "access/tableam.h"
 #include "catalog/pg_am.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexscan.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
@@ -69,6 +70,9 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static IndexPrefetchEntry *IndexScanPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +89,8 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -98,11 +104,14 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int prefetch_max;
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -123,15 +132,43 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->iss_prefetch = IndexPrefetchAlloc(IndexScanPrefetchNext,
+												NULL, /* no extra cleanup */
+												prefetch_max,
+												NULL);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scandesc->xs_heaptid));
+		if (!index_fetch_heap(scandesc, slot))
+			continue;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
@@ -588,6 +625,9 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->iss_ScanDesc, node->iss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -794,6 +834,9 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->iss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1728,3 +1771,26 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+/*
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
+ */
+static IndexPrefetchEntry *
+IndexScanPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch always */
+		entry->prefetch = true;
+	}
+
+	return entry;
+}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5e8c335a737..e792c3fc8d8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -677,4 +677,56 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execPrefetch.c
+ */
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData tid;
+
+	/* should we prefetch heap page for this TID? */
+	bool		prefetch;
+
+	/*
+	 * If a callback is specified, it may store per-tid information. The data
+	 * has to be a single palloc-ed piece of data, so that it can be easily
+	 * pfreed.
+	 *
+	 * XXX We could relax this by providing another cleanup callback, but that
+	 * seems unnecessarily complex - we expect the information to be very
+	 * simple, like bool flags or something. Easy to do in a simple struct,
+	 * and perhaps even reuse without pfree/palloc.
+	 */
+	void	   *data;
+} IndexPrefetchEntry;
+
+/*
+ * custom callback, allowing the user code to determine which TID to read
+ *
+ * If there is no TID to prefetch, the return value is expected to be NULL.
+ *
+ * Otherwise the "tid" field is expected to contain the TID to prefetch, and
+ * "data" may be set to custom information the callback needs to pass outside.
+ */
+typedef IndexPrefetchEntry *(*IndexPrefetchNextCB) (IndexScanDesc scan,
+													ScanDirection direction,
+													void *data);
+
+typedef void (*IndexPrefetchCleanupCB) (IndexScanDesc scan,
+										void *data);
+
+IndexPrefetch *IndexPrefetchAlloc(IndexPrefetchNextCB next_cb,
+								  IndexPrefetchCleanupCB cleanup_cb,
+								  int prefetch_max, void *data);
+
+IndexPrefetchEntry *IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *state,
+									  ScanDirection direction);
+
+extern void IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state);
+
+extern int	IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bfd7b6d8445..fadeb389495 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 561fdd98f1b..141db5d4ae2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -690,6 +690,7 @@ typedef struct EState
 	struct EPQState *es_epq_active;
 
 	bool		es_use_parallel_mode;	/* can we use parallel workers? */
+	bool		es_use_prefetching; /* can we use prefetching? */
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
@@ -1529,6 +1530,9 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+/* needs to be before IndexPrefetchCallback typedef */
+typedef struct IndexPrefetch IndexPrefetch;
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1580,6 +1584,9 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1618,6 +1625,9 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f582eb59e7d..9d194ec2715 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1183,6 +1183,9 @@ IndexOnlyScanState
 IndexOptInfo
 IndexOrderByDistance
 IndexPath
+IndexPrefetch
+IndexPrefetchCacheEntry
+IndexPrefetchEntry
 IndexRuntimeKeyInfo
 IndexScan
 IndexScanDesc
-- 
2.43.0

Reply via email to