On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman
<[email protected]> wrote:
>
> On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <[email protected]> wrote:
> >
> > ## AIO API overview
> >
> > The main steps to use AIO (without higher level helpers) are:
> >
> > 1) acquire an "unused" AIO: pgaio_io_get()
> >
> > 2) start some IO, this is done by functions like
> > pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
> >
> > The (read|write|fsync|flush_range) indicates the operation, whereas
> > (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
> >
> > (see below for more details about this design choice - it might or not be
> > right)
> >
> > 3) optionally: assign a backend-local completion callback to the IO
> > (pgaio_io_on_completion_local())
> >
> > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
> > put on a per-backend list of pending IOs. The pending IOs can be
> > explicitly
> > be flushed pgaio_submit_pending(), but will also be submitted if the
> > pending list gets to be too large, or if the current backend waits for
> > the
> > IO.
> >
> > The are two main reasons not to submit the IO immediately:
> > - If adjacent, we can merge several IOs into one "kernel level" IO during
> > submission. Larger IOs are considerably more efficient.
> > - Several AIO APIs allow to submit a batch of IOs in one system call.
> >
> > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
> > backend. When other backends may need to wait for an IO to finish,
> > pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
> > BufferDesc), which can be waited for using pgaio_io_wait_ref().
> >
> > 6) Process the results of the request. If a callback was registered in 3),
> > this isn't always necessary. The results of AIO can be accessed using
> > pgaio_io_result() which returns an integer where negative numbers are
> > -errno, and positive numbers are the [partial] success conditions
> > (e.g. potentially indicating a short read).
> >
> > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
> > another operation (pgaio_io_recycle())
> >
> >
> > Most places that want to use AIO shouldn't themselves need to care about
> > managing the number of writes in flight, or the readahead distance. To help
> > with that there are two helper utilities, a "streaming read" and a
> > "streaming
> > write".
> >
> > The "streaming read" helper uses a callback to determine which blocks to
> > prefetch - that allows to do readahead in a sequential fashion but
> > importantly
> > also allows to asynchronously "read ahead" non-sequential blocks.
> >
> > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility
> > map
> > to figure out which block needs to be read next. Similarly
> > lazy_vacuum_heap()
> > uses the tids in LVDeadTuples to figure out which blocks are going to be
> > needed. Here's the latter as an example:
> > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
> >
>
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.
>
...
>
> Oh, and I haven't done testing to see how effective the prefetching is
> -- that is a larger project that I have yet to tackle.
>
I have done some testing on how effective it is now.
I've also updated the original patch to count the first page (in the
lossy/exact page counts mentioned down-thread) as well as to remove
unused prefetch fields and comments.
I've also included a second patch which adds IO wait time information to
EXPLAIN output when used like:
EXPLAIN (buffers, analyze) SELECT ...
The same commit also introduces a temporary dev GUC
io_bitmap_prefetch_depth which I am using to experiment with the
prefetch window size.
I wanted to share some results from changing the prefetch window to
demonstrate how prefetching is working.
The short version of my results is that the prefetching works:
- with the prefetch window set to 1, the IO wait time is 1550 ms
- with the prefetch window set to 128, the IO wait time is 0.18 ms
DDL and repro details below:
On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set
applied built with the following build flags:
-02 -fno-omit-frame-pointer --with-liburing
And these non-default PostgreSQL settings:
io_data_direct=1
io_data_force_async=off
io_method=io_uring
log_min_duration_statement=0
log_duration=on
set track_io_timing to on;
set max_parallel_workers_per_gather to 0;
set enable_seqscan to off;
set enable_indexscan to off;
set enable_bitmapscan to on;
set effective_io_concurrency to 128;
set io_bitmap_prefetch_depth to 128;
Using this DDL:
drop table if exists bar;
create table bar(a int, b text, c text, d int);
create index bar_idx on bar(a);
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,1000)i;
insert into bar select i%3, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,200)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%2000, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,10000000)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,2000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i;
analyze;
And this query:
select * from bar where a > 100 offset 10000000000000;
with the prefetch window set to 1,
the query execution time is:
5496.129 ms
and IO wait time is:
1550.915
mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
Buffers: shared hit=1 read=280571
I/O Timings: read=1315.845 wait=1550.915
-> Bitmap Heap Scan on bar (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
Recheck Cond: (a > 100)
Rows Removed by Index Recheck: 400281
Heap Blocks: exact=47915 lossy=197741
Buffers: shared hit=1 read=280571
I/O Timings: read=1315.845 wait=1550.915
-> Bitmap Index Scan on bar_idx (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
Index Cond: (a > 100)
Buffers: shared hit=1 read=34915
I/O Timings: read=1315.845
Planning:
Buffers: shared hit=96 read=30
I/O Timings: read=3.399
Planning Time: 4.378 ms
Execution Time: 5473.404 ms
(18 rows)
with the prefetch window set to 128,
the query execution time is:
3222 ms
and IO wait time is;
0.178 ms
mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
Buffers: shared hit=1 read=280571
I/O Timings: read=1339.795 wait=0.178
-> Bitmap Heap Scan on bar (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
Recheck Cond: (a > 100)
Rows Removed by Index Recheck: 400281
Heap Blocks: exact=47915 lossy=197741
Buffers: shared hit=1 read=280571
I/O Timings: read=1339.795 wait=0.178
-> Bitmap Index Scan on bar_idx (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
Index Cond: (a > 100)
Buffers: shared hit=1 read=34915
I/O Timings: read=1339.795
Planning:
Buffers: shared hit=96 read=30
I/O Timings: read=3.488
Planning Time: 4.279 ms
Execution Time: 3434.522 ms
(18 rows)
- Melanie
[1] https://github.com/anarazel/postgres/tree/aio
From 752fd6cc636eea08c06b25d8898e091835442387 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Thu, 5 Aug 2021 15:47:50 -0400
Subject: [PATCH v2 2/2] Add IO wait time stat and add guc for BHS prefetch
- Add an IO wait time measurement which can be seen in EXPLAIN output
(with explain buffers option and track_io_timing on)
- TODO: add the wait time to database statistics
- Also, add a GUC to control the BHS pre-fetch max window size
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/commands/explain.c | 9 ++++++++-
src/backend/executor/instrument.c | 4 ++++
src/backend/postmaster/pgstat.c | 6 ++++++
src/backend/storage/aio/aio.c | 16 ++++++++++++++++
src/backend/storage/aio/aio_util.c | 2 --
src/backend/storage/buffer/bufmgr.c | 2 ++
src/backend/utils/adt/pgstatfuncs.c | 2 ++
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/executor/instrument.h | 1 +
src/include/pgstat.h | 6 ++++++
src/include/storage/bufmgr.h | 1 +
12 files changed, 57 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8bd8050dc..7409dd2fb3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2221,7 +2221,7 @@ void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate)
HeapScanDesc hscan = (HeapScanDesc ) scanstate->ss.ss_currentScanDesc;
if (!hscan->rs_inited)
{
- int iodepth = Max(Min(128, NBuffers / 128), 1);
+ int iodepth = io_bitmap_prefetch_depth;
hscan->pgsr = pg_streaming_read_alloc(iodepth, (uintptr_t) scanstate,
bitmapheapscan_pgsr_next_single,
bitmapheapscan_pgsr_release);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c66d39a5c7..d1e89f1e1b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3339,7 +3339,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
bool has_temp = (usage->temp_blks_read > 0 ||
usage->temp_blks_written > 0);
bool has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
- !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+ !INSTR_TIME_IS_ZERO(usage->blk_write_time ||
+ !INSTR_TIME_IS_ZERO(usage->io_wait_time)));
bool show_planning = (planning && (has_shared ||
has_local || has_temp || has_timing));
@@ -3416,6 +3417,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
appendStringInfo(es->str, " write=%0.3f",
INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+ if (!INSTR_TIME_IS_ZERO(usage->io_wait_time))
+ appendStringInfo(es->str, " wait=%0.3f",
+ INSTR_TIME_GET_MILLISEC(usage->io_wait_time));
appendStringInfoChar(es->str, '\n');
}
@@ -3452,6 +3456,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
ExplainPropertyFloat("I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->blk_write_time),
3, es);
+ ExplainPropertyFloat("I/O Wait Time", "ms",
+ INSTR_TIME_GET_MILLISEC(usage->io_wait_time),
+ 5, es);
}
}
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 96af2a2673..85e7321d63 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -218,6 +218,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
dst->temp_blks_written += add->temp_blks_written;
INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+ INSTR_TIME_ADD(dst->io_wait_time, add->io_wait_time);
+
}
/* dst += add - sub */
@@ -240,6 +242,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->blk_read_time, sub->blk_read_time);
INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
add->blk_write_time, sub->blk_write_time);
+ INSTR_TIME_ACCUM_DIFF(dst->io_wait_time,
+ add->io_wait_time, sub->io_wait_time);
}
/* helper functions for WAL usage accumulation */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9d335b8507..028ca14aa8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -258,6 +258,7 @@ static int pgStatXactCommit = 0;
static int pgStatXactRollback = 0;
PgStat_Counter pgStatBlockReadTime = 0;
PgStat_Counter pgStatBlockWriteTime = 0;
+PgStat_Counter pgStatIOWaitTime = 0;
static PgStat_Counter pgStatActiveTime = 0;
static PgStat_Counter pgStatTransactionIdleTime = 0;
SessionEndType pgStatSessionEndCause = DISCONNECT_NORMAL;
@@ -1004,10 +1005,12 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
tsmsg->m_xact_rollback = pgStatXactRollback;
tsmsg->m_block_read_time = pgStatBlockReadTime;
tsmsg->m_block_write_time = pgStatBlockWriteTime;
+ tsmsg->m_io_wait_time = pgStatIOWaitTime;
pgStatXactCommit = 0;
pgStatXactRollback = 0;
pgStatBlockReadTime = 0;
pgStatBlockWriteTime = 0;
+ pgStatIOWaitTime = 0;
}
else
{
@@ -1015,6 +1018,7 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
tsmsg->m_xact_rollback = 0;
tsmsg->m_block_read_time = 0;
tsmsg->m_block_write_time = 0;
+ tsmsg->m_io_wait_time = 0;
}
n = tsmsg->m_nentries;
@@ -5122,6 +5126,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
dbentry->last_checksum_failure = 0;
dbentry->n_block_read_time = 0;
dbentry->n_block_write_time = 0;
+ dbentry->n_io_wait_time = 0;
dbentry->n_sessions = 0;
dbentry->total_session_time = 0;
dbentry->total_active_time = 0;
@@ -6442,6 +6447,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
dbentry->n_block_read_time += msg->m_block_read_time;
dbentry->n_block_write_time += msg->m_block_write_time;
+ dbentry->n_io_wait_time += msg->m_io_wait_time;
/*
* Process all table entries in the message.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 726d50f698..c5aacbcf9d 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -36,6 +36,8 @@
#include "utils/guc.h"
#include "utils/memutils.h"
#include "utils/resowner_private.h"
+#include "executor/instrument.h"
+#include "storage/bufmgr.h"
#define PGAIO_VERBOSE
@@ -1729,6 +1731,12 @@ wait_ref_again:
}
else if (io_method != IOMETHOD_WORKER && (flags & PGAIOIP_INFLIGHT))
{
+ int io_wait_start, io_wait_time;
+ if (track_io_timing)
+ INSTR_TIME_SET_CURRENT(io_wait_start);
+ else
+ INSTR_TIME_SET_ZERO(io_wait_start);
+
/* note that this is allowed to spuriously return */
if (io_method == IOMETHOD_WORKER)
ConditionVariableSleep(&io->cv, WAIT_EVENT_AIO_IO_COMPLETE_ONE);
@@ -1741,6 +1749,14 @@ wait_ref_again:
else if (io_method == IOMETHOD_POSIX)
pgaio_posix_wait_one(io, ref_generation);
#endif
+
+ if (track_io_timing)
+ {
+ INSTR_TIME_SET_CURRENT(io_wait_time);
+ INSTR_TIME_SUBTRACT(io_wait_time, io_wait_start);
+ pgstat_count_io_wait_time(INSTR_TIME_GET_MICROSEC(io_wait_time));
+ INSTR_TIME_ADD(pgBufferUsage.io_wait_time, io_wait_time);
+ }
}
#ifdef USE_POSIX_AIO
/* XXX untangle this */
diff --git a/src/backend/storage/aio/aio_util.c b/src/backend/storage/aio/aio_util.c
index 35436dfcd3..a79db4d747 100644
--- a/src/backend/storage/aio/aio_util.c
+++ b/src/backend/storage/aio/aio_util.c
@@ -417,8 +417,6 @@ pg_streaming_read_alloc(uint32 iodepth, uintptr_t pgsr_private,
{
PgStreamingRead *pgsr;
- iodepth = Max(Min(iodepth, NBuffers / 128), 1);
-
pgsr = palloc0(offsetof(PgStreamingRead, all_items) +
sizeof(PgStreamingReadItem) * iodepth * 2);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5df1b78f2..69175699b0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -145,6 +145,8 @@ bool track_io_timing = false;
*/
int effective_io_concurrency = 0;
+int io_bitmap_prefetch_depth = 128;
+
/*
* Like effective_io_concurrency, but used by maintenance code paths that might
* benefit from a higher setting because they work on behalf of many sessions.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2fca05f7af..a6d4f121e1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1615,6 +1615,8 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
PG_RETURN_FLOAT8(result);
}
+// TODO: add one for io wait time
+
Datum
pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d143783f22..f4f02fb9ad 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3029,6 +3029,16 @@ static struct config_int ConfigureNamesInt[] =
0, MAX_IO_CONCURRENCY,
check_maintenance_io_concurrency, NULL, NULL
},
+ {
+ {"io_bitmap_prefetch_depth", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Maximum pre-fetch distance for bitmapheapscan"),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &io_bitmap_prefetch_depth,
+ 128, 0, MAX_IO_CONCURRENCY,
+ NULL, NULL, NULL
+ },
{
{"io_wal_concurrency", PGC_POSTMASTER, RESOURCES_ASYNCHRONOUS,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 62ca398e9d..950414cc0b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -30,6 +30,7 @@ typedef struct BufferUsage
long temp_blks_written; /* # of temp blocks written */
instr_time blk_read_time; /* time spent reading */
instr_time blk_write_time; /* time spent writing */
+ instr_time io_wait_time;
} BufferUsage;
typedef struct WalUsage
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b77dcbc58b..a02813f02f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_MsgTabstat
int m_xact_rollback;
PgStat_Counter m_block_read_time; /* times in microseconds */
PgStat_Counter m_block_write_time;
+ PgStat_Counter m_io_wait_time;
PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
} PgStat_MsgTabstat;
@@ -732,6 +733,7 @@ typedef struct PgStat_StatDBEntry
TimestampTz last_checksum_failure;
PgStat_Counter n_block_read_time; /* times in microseconds */
PgStat_Counter n_block_write_time;
+ PgStat_Counter n_io_wait_time;
PgStat_Counter n_sessions;
PgStat_Counter total_session_time;
PgStat_Counter total_active_time;
@@ -1417,6 +1419,8 @@ extern PgStat_MsgWal WalStats;
extern PgStat_Counter pgStatBlockReadTime;
extern PgStat_Counter pgStatBlockWriteTime;
+extern PgStat_Counter pgStatIOWaitTime;
+
/*
* Updated by the traffic cop and in errfinish()
*/
@@ -1593,6 +1597,8 @@ pgstat_report_wait_end(void)
(pgStatBlockReadTime += (n))
#define pgstat_count_buffer_write_time(n) \
(pgStatBlockWriteTime += (n))
+#define pgstat_count_io_wait_time(n) \
+ (pgStatIOWaitTime += (n))
extern void pgstat_count_heap_insert(Relation rel, PgStat_Counter n);
extern void pgstat_count_heap_update(Relation rel, bool hot);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 07401f8493..8c21bb6e56 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -70,6 +70,7 @@ extern int bgwriter_lru_maxpages;
extern double bgwriter_lru_multiplier;
extern bool track_io_timing;
extern int effective_io_concurrency;
+extern int io_bitmap_prefetch_depth;
extern int maintenance_io_concurrency;
extern int checkpoint_flush_after;
--
2.27.0
From 4ef0eaf4aed17becd94d085c8092b8b79d7bca93 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 22 Jun 2021 16:14:58 -0400
Subject: [PATCH v2 1/2] Use pgsr for AIO bitmapheapscan
---
src/backend/access/gin/ginget.c | 18 +-
src/backend/access/gin/ginscan.c | 4 +
src/backend/access/heap/heapam_handler.c | 190 +++++++-
src/backend/executor/nodeBitmapHeapscan.c | 505 ++--------------------
src/backend/nodes/tidbitmap.c | 55 ++-
src/include/access/gin_private.h | 5 +
src/include/access/heapam.h | 2 +
src/include/access/tableam.h | 4 +-
src/include/executor/nodeBitmapHeapscan.h | 1 +
src/include/nodes/execnodes.h | 24 +-
src/include/nodes/tidbitmap.h | 7 +-
src/include/storage/aio.h | 2 +-
12 files changed, 279 insertions(+), 538 deletions(-)
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 03191e016c..ef7c284cd0 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -311,6 +311,8 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
}
}
+#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage
+
/*
* Start* functions setup beginning state of searches: finds correct buffer and pins it.
*/
@@ -332,6 +334,7 @@ restartScanEntry:
entry->nlist = 0;
entry->matchBitmap = NULL;
entry->matchResult = NULL;
+ entry->savedMatchResult = NULL;
entry->reduceResult = false;
entry->predictNumberResult = 0;
@@ -372,7 +375,10 @@ restartScanEntry:
if (entry->matchBitmap)
{
if (entry->matchIterator)
+ {
tbm_end_iterate(entry->matchIterator);
+ pfree(entry->savedMatchResult);
+ }
entry->matchIterator = NULL;
tbm_free(entry->matchBitmap);
entry->matchBitmap = NULL;
@@ -385,6 +391,8 @@ restartScanEntry:
if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
{
entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+ entry->savedMatchResult = (TBMIterateResult *) palloc0(sizeof(TBMIterateResult) +
+ MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
entry->isFinished = false;
}
}
@@ -790,6 +798,7 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry,
#define gin_rand() (((double) random()) / ((double) MAX_RANDOM_VALUE))
#define dropItem(e) ( gin_rand() > ((double)GinFuzzySearchLimit)/((double)((e)->predictNumberResult)) )
+
/*
* Sets entry->curItem to next heap item pointer > advancePast, for one entry
* of one scan key, or sets entry->isFinished to true if there are no more.
@@ -817,7 +826,6 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
/* A bitmap result */
BlockNumber advancePastBlk = GinItemPointerGetBlockNumber(&advancePast);
OffsetNumber advancePastOff = GinItemPointerGetOffsetNumber(&advancePast);
-
for (;;)
{
/*
@@ -831,12 +839,18 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
(ItemPointerIsLossyPage(&advancePast) &&
entry->matchResult->blockno == advancePastBlk))
{
- entry->matchResult = tbm_iterate(entry->matchIterator);
+
+ tbm_iterate(entry->matchIterator, entry->savedMatchResult);
+ if (!BlockNumberIsValid(entry->savedMatchResult->blockno))
+ entry->matchResult = NULL;
+ else
+ entry->matchResult = entry->savedMatchResult;
if (entry->matchResult == NULL)
{
ItemPointerSetInvalid(&entry->curItem);
tbm_end_iterate(entry->matchIterator);
+ pfree(entry->savedMatchResult);
entry->matchIterator = NULL;
entry->isFinished = true;
break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index 55e2d49fd7..3fd9310887 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -107,6 +107,7 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
scanEntry->matchBitmap = NULL;
scanEntry->matchIterator = NULL;
scanEntry->matchResult = NULL;
+ scanEntry->savedMatchResult = NULL;
scanEntry->list = NULL;
scanEntry->nlist = 0;
scanEntry->offset = InvalidOffsetNumber;
@@ -246,7 +247,10 @@ ginFreeScanKeys(GinScanOpaque so)
if (entry->list)
pfree(entry->list);
if (entry->matchIterator)
+ {
tbm_end_iterate(entry->matchIterator);
+ pfree(entry->savedMatchResult);
+ }
if (entry->matchBitmap)
tbm_free(entry->matchBitmap);
}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9c65741c41..a8bd8050dc 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -27,6 +27,7 @@
#include "access/syncscan.h"
#include "access/tableam.h"
#include "access/tsmapi.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -36,6 +37,7 @@
#include "executor/executor.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/lmgr.h"
@@ -57,6 +59,11 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
static const TableAmRoutine heapam_methods;
+static PgStreamingReadNextStatus
+bitmapheapscan_pgsr_next_single(uintptr_t pgsr_private, PgAioInProgress *aio, uintptr_t *read_private);
+static void
+bitmapheapscan_pgsr_release(uintptr_t pgsr_private, uintptr_t read_private);
+
/* ------------------------------------------------------------------------
* Slot related callbacks for heap AM
@@ -2106,13 +2113,128 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
* Executor related callbacks for the heap AM
* ------------------------------------------------------------------------
*/
+#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage
+
+// TODO: for heap, these are in heapam.c instead of heapam_handler.c
+// but, heap may move where it does the setup of pgsr
+static void
+bitmapheapscan_pgsr_release(uintptr_t pgsr_private, uintptr_t read_private)
+{
+ BitmapHeapScanState *bhs_state = (BitmapHeapScanState *) pgsr_private;
+ HeapScanDesc hdesc = (HeapScanDesc ) bhs_state->ss.ss_currentScanDesc;
+ TBMIterateResult *tbmres = (TBMIterateResult *) read_private;
+
+ ereport(DEBUG3,
+ errmsg("pgsr %s: releasing buf %d",
+ NameStr(hdesc->rs_base.rs_rd->rd_rel->relname),
+ tbmres->buffer),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ Assert(BufferIsValid(tbmres->buffer));
+ ReleaseBuffer(tbmres->buffer);
+}
+
+static PgStreamingReadNextStatus
+bitmapheapscan_pgsr_next_single(uintptr_t pgsr_private, PgAioInProgress *aio, uintptr_t *read_private)
+{
+ bool already_valid;
+ bool skip_fetch;
+ BitmapHeapScanState *bhs_state = (BitmapHeapScanState *) pgsr_private;
+ /*
+ * TODO: instead of passing the BitmapHeapScanState node when setting up
+ * and ultimately using it here as pgsr_private, perhaps I can pass only the
+ * iterator by adding a pointer to the HeapScanDesc to the iterator and
+ * moving the vmbuffer into the heapscandesc and also add can_skip_fetch to
+ * the iterator and then pass the iterator as the private state.
+ * If doing this, will need a separate bitmapheapscan_pgsr_next_parallel() in
+ * addition to the bitmapheapscan_pgsr_next_single() which would use the
+ * shared_tbmiterator instead of the tbmiterator() (and then would need separate
+ * alloc functions for setup and potentially different release functions).
+ */
+ ParallelBitmapHeapState *pstate = bhs_state->pstate;
+ HeapScanDesc hdesc = (HeapScanDesc ) bhs_state->ss.ss_currentScanDesc;
+ TBMIterateResult *tbmres = (TBMIterateResult *) palloc0(sizeof(TBMIterateResult) +
+ MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ Assert(bhs_state->initialized);
+ if (pstate == NULL)
+ tbm_iterate(bhs_state->tbmiterator, tbmres);
+ else
+ tbm_shared_iterate(bhs_state->shared_tbmiterator, tbmres);
+
+ // TODO: could this be invalid for another reason than hit_end?
+ if (!BlockNumberIsValid(tbmres->blockno))
+ {
+ pfree(tbmres);
+ tbmres = NULL;
+ *read_private = 0;
+ return PGSR_NEXT_END;
+ }
+ /*
+ * Ignore any claimed entries past what we think is the end of the
+ * relation. It may have been extended after the start of our scan (we
+ * only hold an AccessShareLock, and it could be inserts from this
+ * backend).
+ */
+ if (tbmres->blockno >= hdesc->rs_nblocks)
+ {
+ tbmres->blockno = InvalidBlockNumber;
+ *read_private = (uintptr_t) tbmres;
+ return PGSR_NEXT_NO_IO;
+ }
+
+ /*
+ * We can skip fetching the heap page if we don't need any fields
+ * from the heap, and the bitmap entries don't need rechecking,
+ * and all tuples on the page are visible to our transaction.
+ */
+ skip_fetch = (bhs_state->can_skip_fetch && !tbmres->recheck &&
+ VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno,
+ &bhs_state->vmbuffer));
+
+ if (skip_fetch)
+ {
+ /*
+ * The number of tuples on this page is put into
+ * node->return_empty_tuples.
+ */
+ tbmres->buffer = InvalidBuffer;
+ *read_private = (uintptr_t) tbmres;
+ return PGSR_NEXT_NO_IO;
+ }
+ tbmres->buffer = ReadBufferAsync(hdesc->rs_base.rs_rd,
+ MAIN_FORKNUM,
+ tbmres->blockno,
+ RBM_NORMAL, hdesc->rs_strategy, &already_valid,
+ &aio);
+ *read_private = (uintptr_t) tbmres;
+
+ if (already_valid)
+ return PGSR_NEXT_NO_IO;
+ else
+ return PGSR_NEXT_IO;
+}
+
+// TODO: put this in the right place
+void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate)
+{
+ HeapScanDesc hscan = (HeapScanDesc ) scanstate->ss.ss_currentScanDesc;
+ if (!hscan->rs_inited)
+ {
+ int iodepth = Max(Min(128, NBuffers / 128), 1);
+ hscan->pgsr = pg_streaming_read_alloc(iodepth, (uintptr_t) scanstate,
+ bitmapheapscan_pgsr_next_single,
+ bitmapheapscan_pgsr_release);
+
+ hscan->rs_inited = true;
+ }
+}
static bool
heapam_scan_bitmap_next_block(TableScanDesc scan,
- TBMIterateResult *tbmres)
+ TBMIterateResult **tbmres)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
- BlockNumber page = tbmres->blockno;
Buffer buffer;
Snapshot snapshot;
int ntup;
@@ -2120,22 +2242,35 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;
- /*
- * Ignore any claimed entries past what we think is the end of the
- * relation. It may have been extended after the start of our scan (we
- * only hold an AccessShareLock, and it could be inserts from this
- * backend).
- */
- if (page >= hscan->rs_nblocks)
+ Assert(hscan->pgsr);
+ if (*tbmres)
+ {
+ if (BufferIsValid((*tbmres)->buffer))
+ ReleaseBuffer((*tbmres)->buffer);
+ hscan->rs_cbuf = InvalidBuffer;
+ pfree(*tbmres);
+ }
+
+ *tbmres = (TBMIterateResult *) pg_streaming_read_get_next(hscan->pgsr);
+ /* hit the end */
+ if (*tbmres == NULL)
+ return true;
+
+ /* Invalid due to past the end of the relation */
+ if (!BlockNumberIsValid((*tbmres)->blockno))
+ {
+ pfree(*tbmres);
+ *tbmres = NULL;
return false;
+ }
+
+ hscan->rs_cblock = (*tbmres)->blockno;
+ hscan->rs_cbuf = (*tbmres)->buffer;
+
+ /* Skipped fetching, we'll still use ntuples though */
+ if (!(BufferIsValid(hscan->rs_cbuf)))
+ return true;
- /*
- * Acquire pin on the target heap page, trading in any pin we held before.
- */
- hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
- scan->rs_rd,
- page);
- hscan->rs_cblock = page;
buffer = hscan->rs_cbuf;
snapshot = scan->rs_snapshot;
@@ -2156,7 +2291,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
/*
* We need two separate strategies for lossy and non-lossy cases.
*/
- if (tbmres->ntuples >= 0)
+ if ((*tbmres)->ntuples >= 0)
{
/*
* Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2165,13 +2300,13 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
*/
int curslot;
- for (curslot = 0; curslot < tbmres->ntuples; curslot++)
+ for (curslot = 0; curslot < (*tbmres)->ntuples; curslot++)
{
- OffsetNumber offnum = tbmres->offsets[curslot];
+ OffsetNumber offnum = (*tbmres)->offsets[curslot];
ItemPointerData tid;
HeapTupleData heapTuple;
- ItemPointerSet(&tid, page, offnum);
+ ItemPointerSet(&tid, (*tbmres)->blockno, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
&heapTuple, NULL, true))
hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
@@ -2199,7 +2334,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp);
loctup.t_len = ItemIdGetLength(lp);
loctup.t_tableOid = scan->rs_rd->rd_id;
- ItemPointerSet(&loctup.t_self, page, offnum);
+ ItemPointerSet(&loctup.t_self, (*tbmres)->blockno, offnum);
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
if (valid)
{
@@ -2230,6 +2365,19 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
Page dp;
ItemId lp;
+ /* we skipped fetching */
+ if (BufferIsInvalid(tbmres->buffer))
+ {
+ Assert(tbmres->ntuples >= 0);
+ if (tbmres->ntuples > 0)
+ {
+ ExecStoreAllNullTuple(slot);
+ tbmres->ntuples--;
+ return true;
+ }
+ Assert(tbmres->ntuples == 0);
+ return false;
+ }
/*
* Out of range? If so, nothing more to look at on this page
*/
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3861bd8a24..3ef25a8f36 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -36,6 +36,8 @@
#include "postgres.h"
#include <math.h>
+// TODO: delete me after moving scan setup function
+#include "access/heapam.h"
#include "access/relscan.h"
#include "access/tableam.h"
@@ -55,13 +57,13 @@
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
- TableScanDesc scan);
static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
+// TODO: add to tableam
+void table_bitmap_scan_setup(BitmapHeapScanState *scanstate)
+{
+ bitmapheap_pgsr_alloc(scanstate);
+}
/* ----------------------------------------------------------------
* BitmapHeapNext
@@ -75,10 +77,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
ExprContext *econtext;
TableScanDesc scan;
TIDBitmap *tbm;
- TBMIterator *tbmiterator = NULL;
- TBMSharedIterator *shared_tbmiterator = NULL;
- TBMIterateResult *tbmres;
- TupleTableSlot *slot;
+ TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
ParallelBitmapHeapState *pstate = node->pstate;
dsa_area *dsa = node->ss.ps.state->es_query_dsa;
@@ -89,23 +88,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
slot = node->ss.ss_ScanTupleSlot;
scan = node->ss.ss_currentScanDesc;
tbm = node->tbm;
- if (pstate == NULL)
- tbmiterator = node->tbmiterator;
- else
- shared_tbmiterator = node->shared_tbmiterator;
- tbmres = node->tbmres;
/*
* If we haven't yet performed the underlying index scan, do it, and begin
* the iteration over the bitmap.
- *
- * For prefetching, we use *two* iterators, one for the pages we are
- * actually scanning and another that runs ahead of the first for
- * prefetching. node->prefetch_pages tracks exactly how many pages ahead
- * the prefetch iterator is. Also, node->prefetch_target tracks the
- * desired prefetch distance, which starts small and increases up to the
- * node->prefetch_maximum. This is to avoid doing a lot of prefetching in
- * a scan that stops after a few tuples because of a LIMIT.
*/
if (!node->initialized)
{
@@ -117,17 +103,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
elog(ERROR, "unrecognized result from subplan");
node->tbm = tbm;
- node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
- node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->prefetch_iterator = tbm_begin_iterate(tbm);
- node->prefetch_pages = 0;
- node->prefetch_target = -1;
- }
-#endif /* USE_PREFETCH */
+ node->tbmiterator = tbm_begin_iterate(tbm);
}
else
{
@@ -143,180 +119,50 @@ BitmapHeapNext(BitmapHeapScanState *node)
elog(ERROR, "unrecognized result from subplan");
node->tbm = tbm;
-
/*
* Prepare to iterate over the TBM. This will return the
* dsa_pointer of the iterator state which will be used by
* multiple processes to iterate jointly.
*/
- pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- pstate->prefetch_iterator =
- tbm_prepare_shared_iterate(tbm);
-
- /*
- * We don't need the mutex here as we haven't yet woke up
- * others.
- */
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = -1;
- }
-#endif
+ pstate->tbmiterator =
+ tbm_prepare_shared_iterate(tbm);
/* We have initialized the shared state so wake up others. */
BitmapDoneInitializingSharedState(pstate);
}
/* Allocate a private iterator and attach the shared state to it */
- node->shared_tbmiterator = shared_tbmiterator =
+ node->shared_tbmiterator =
tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
- node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
- if (node->prefetch_maximum > 0)
- {
- node->shared_prefetch_iterator =
- tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
- }
-#endif /* USE_PREFETCH */
}
node->initialized = true;
+ /* do any required setup, such as setting up streaming read helper */
+ // TODO: modify for parallel as relevant
+ table_bitmap_scan_setup(node);
+ /* get the first block */
+ while (!table_scan_bitmap_next_block(scan, &node->tbmres));
+ if (node->tbmres == NULL)
+ return NULL;
+
+ if (node->tbmres->ntuples >= 0)
+ node->exact_pages++;
+ else
+ node->lossy_pages++;
}
+
+ // TODO: seems like it would be more clear to have an independent function
+ // getting the next tuple and block and then only have the recheck here.
+ // the loop condition would be next_tuple != NULL
for (;;)
{
- bool skip_fetch;
-
CHECK_FOR_INTERRUPTS();
-
- /*
- * Get next page of results if needed
- */
- if (tbmres == NULL)
- {
- if (!pstate)
- node->tbmres = tbmres = tbm_iterate(tbmiterator);
- else
- node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
- if (tbmres == NULL)
- {
- /* no more entries in the bitmap */
- break;
- }
-
- BitmapAdjustPrefetchIterator(node, tbmres);
-
- /*
- * We can skip fetching the heap page if we don't need any fields
- * from the heap, and the bitmap entries don't need rechecking,
- * and all tuples on the page are visible to our transaction.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- skip_fetch = (node->can_skip_fetch &&
- !tbmres->recheck &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmres->blockno,
- &node->vmbuffer));
-
- if (skip_fetch)
- {
- /* can't be lossy in the skip_fetch case */
- Assert(tbmres->ntuples >= 0);
-
- /*
- * The number of tuples on this page is put into
- * node->return_empty_tuples.
- */
- node->return_empty_tuples = tbmres->ntuples;
- }
- else if (!table_scan_bitmap_next_block(scan, tbmres))
- {
- /* AM doesn't think this block is valid, skip */
- continue;
- }
-
- if (tbmres->ntuples >= 0)
- node->exact_pages++;
- else
- node->lossy_pages++;
-
- /* Adjust the prefetch target */
- BitmapAdjustPrefetchTarget(node);
- }
- else
- {
- /*
- * Continuing in previously obtained page.
- */
-
-#ifdef USE_PREFETCH
-
- /*
- * Try to prefetch at least a few pages even before we get to the
- * second page if we don't stop reading after the first tuple.
- */
- if (!pstate)
- {
- if (node->prefetch_target < node->prefetch_maximum)
- node->prefetch_target++;
- }
- else if (pstate->prefetch_target < node->prefetch_maximum)
- {
- /* take spinlock while updating shared state */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target < node->prefetch_maximum)
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
- }
-
- /*
- * We issue prefetch requests *after* fetching the current page to try
- * to avoid having prefetching interfere with the main I/O. Also, this
- * should happen only when we have determined there is still something
- * to do on the current page, else we may uselessly prefetch the same
- * page we are just about to request for real.
- *
- * XXX: It's a layering violation that we do these checks above
- * tableam, they should probably moved below it at some point.
- */
- BitmapPrefetch(node, scan);
-
- if (node->return_empty_tuples > 0)
+ /* Attempt to fetch tuple from AM. */
+ if (table_scan_bitmap_next_tuple(scan, node->tbmres, slot))
{
- /*
- * If we don't have to fetch the tuple, just return nulls.
- */
- ExecStoreAllNullTuple(slot);
-
- if (--node->return_empty_tuples == 0)
- {
- /* no more tuples to return in the next round */
- node->tbmres = tbmres = NULL;
- }
- }
- else
- {
- /*
- * Attempt to fetch tuple from AM.
- */
- if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
- {
- /* nothing more to look at on this page */
- node->tbmres = tbmres = NULL;
- continue;
- }
-
- /*
- * If we are using lossy info, we have to recheck the qual
- * conditions at every tuple.
- */
- if (tbmres->recheck)
+ // TODO: couldn't we have recheck set to true when it was only because
+ // the bitmap was lossy and not because the qual needs to be rechecked?
+ if (node->tbmres->recheck)
{
econtext->ecxt_scantuple = slot;
if (!ExecQualAndReset(node->bitmapqualorig, econtext))
@@ -327,16 +173,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
continue;
}
}
+ return slot;
}
- /* OK to return this tuple */
- return slot;
- }
+ /*
+ * Get next page of results
+ */
+ while (!table_scan_bitmap_next_block(scan, &node->tbmres));
- /*
- * if we get here it means we are at the end of the scan..
- */
- return ExecClearTuple(slot);
+ /* if we get here it means we are at the end of the scan */
+ if (node->tbmres == NULL)
+ return NULL;
+
+ if (node->tbmres->ntuples >= 0)
+ node->exact_pages++;
+ else
+ node->lossy_pages++;
+ }
}
/*
@@ -354,235 +207,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
ConditionVariableBroadcast(&pstate->cv);
}
-/*
- * BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
- TBMIterateResult *tbmres)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (node->prefetch_pages > 0)
- {
- /* The main iterator has closed the distance by one page */
- node->prefetch_pages--;
- }
- else if (prefetch_iterator)
- {
- /* Do not let the prefetch iterator get behind the main one */
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
- if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
- elog(ERROR, "prefetch and main iterators are out of sync");
- }
- return;
- }
-
- if (node->prefetch_maximum > 0)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages > 0)
- {
- pstate->prefetch_pages--;
- SpinLockRelease(&pstate->mutex);
- }
- else
- {
- /* Release the mutex before iterating */
- SpinLockRelease(&pstate->mutex);
-
- /*
- * In case of shared mode, we can not ensure that the current
- * blockno of the main iterator and that of the prefetch iterator
- * are same. It's possible that whatever blockno we are
- * prefetching will be processed by another process. Therefore,
- * we don't validate the blockno here as we do in non-parallel
- * case.
- */
- if (prefetch_iterator)
- tbm_shared_iterate(prefetch_iterator);
- }
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max. Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
-
- if (pstate == NULL)
- {
- if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (node->prefetch_target >= node->prefetch_maximum / 2)
- node->prefetch_target = node->prefetch_maximum;
- else if (node->prefetch_target > 0)
- node->prefetch_target *= 2;
- else
- node->prefetch_target++;
- return;
- }
-
- /* Do an unlocked check first to save spinlock acquisitions. */
- if (pstate->prefetch_target < node->prefetch_maximum)
- {
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
- pstate->prefetch_target = node->prefetch_maximum;
- else if (pstate->prefetch_target > 0)
- pstate->prefetch_target *= 2;
- else
- pstate->prefetch_target++;
- SpinLockRelease(&pstate->mutex);
- }
-#endif /* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
- /*
- * FIXME: This really should just all be replaced by using one iterator
- * and a PgStreamingRead. tbm_iterate() actually does a fair bit of work,
- * we don't want to repeat that. Nor is it good to do the buffer mapping
- * lookups twice.
- */
-#ifdef USE_PREFETCH
- ParallelBitmapHeapState *pstate = node->pstate;
- bool issued_prefetch = false;
-
- if (pstate == NULL)
- {
- TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (node->prefetch_pages < node->prefetch_target)
- {
- TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
- bool skip_fetch;
-
- if (tbmpre == NULL)
- {
- /* No more pages to prefetch */
- tbm_end_iterate(prefetch_iterator);
- node->prefetch_iterator = NULL;
- break;
- }
- node->prefetch_pages++;
-
- /*
- * If we expect not to have to actually read this heap page,
- * skip this prefetch call, but continue to run the prefetch
- * logic normally. (Would it be better not to increment
- * prefetch_pages?)
- *
- * This depends on the assumption that the index AM will
- * report the same recheck flag for this future heap page as
- * it did for the current heap page; which is not a certainty
- * but is true in many cases.
- */
- skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- {
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
- issued_prefetch = true;
- }
- }
- }
-
- return;
- }
-
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
- if (prefetch_iterator)
- {
- while (1)
- {
- TBMIterateResult *tbmpre;
- bool do_prefetch = false;
- bool skip_fetch;
-
- /*
- * Recheck under the mutex. If some other process has already
- * done enough prefetching then we need not to do anything.
- */
- SpinLockAcquire(&pstate->mutex);
- if (pstate->prefetch_pages < pstate->prefetch_target)
- {
- pstate->prefetch_pages++;
- do_prefetch = true;
- }
- SpinLockRelease(&pstate->mutex);
-
- if (!do_prefetch)
- return;
-
- tbmpre = tbm_shared_iterate(prefetch_iterator);
- if (tbmpre == NULL)
- {
- /* No more pages to prefetch */
- tbm_end_shared_iterate(prefetch_iterator);
- node->shared_prefetch_iterator = NULL;
- break;
- }
-
- /* As above, skip prefetch if we expect not to need page */
- skip_fetch = (node->can_skip_fetch &&
- (node->tbmres ? !node->tbmres->recheck : false) &&
- VM_ALL_VISIBLE(node->ss.ss_currentRelation,
- tbmpre->blockno,
- &node->pvmbuffer));
-
- if (!skip_fetch)
- {
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
- issued_prefetch = true;
- }
- }
- }
- }
-
- /*
- * The PrefetchBuffer() calls staged IOs, but didn't necessarily submit
- * them, as it is more efficient to amortize the syscall cost across
- * multiple calls.
- */
- if (issued_prefetch)
- pgaio_submit_pending(true);
-#endif /* USE_PREFETCH */
-}
/*
* BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
@@ -631,27 +255,18 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
/* release bitmaps and buffers if any */
if (node->tbmiterator)
tbm_end_iterate(node->tbmiterator);
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
if (node->shared_tbmiterator)
tbm_end_shared_iterate(node->shared_tbmiterator);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
if (node->vmbuffer != InvalidBuffer)
ReleaseBuffer(node->vmbuffer);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
node->tbm = NULL;
node->tbmiterator = NULL;
node->tbmres = NULL;
- node->prefetch_iterator = NULL;
node->initialized = false;
node->shared_tbmiterator = NULL;
- node->shared_prefetch_iterator = NULL;
node->vmbuffer = InvalidBuffer;
- node->pvmbuffer = InvalidBuffer;
ExecScanReScan(&node->ss);
@@ -699,18 +314,12 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
*/
if (node->tbmiterator)
tbm_end_iterate(node->tbmiterator);
- if (node->prefetch_iterator)
- tbm_end_iterate(node->prefetch_iterator);
if (node->tbm)
tbm_free(node->tbm);
if (node->shared_tbmiterator)
tbm_end_shared_iterate(node->shared_tbmiterator);
- if (node->shared_prefetch_iterator)
- tbm_end_shared_iterate(node->shared_prefetch_iterator);
if (node->vmbuffer != InvalidBuffer)
ReleaseBuffer(node->vmbuffer);
- if (node->pvmbuffer != InvalidBuffer)
- ReleaseBuffer(node->pvmbuffer);
/*
* close heap scan
@@ -750,18 +359,12 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->tbm = NULL;
scanstate->tbmiterator = NULL;
scanstate->tbmres = NULL;
- scanstate->return_empty_tuples = 0;
scanstate->vmbuffer = InvalidBuffer;
- scanstate->pvmbuffer = InvalidBuffer;
scanstate->exact_pages = 0;
scanstate->lossy_pages = 0;
- scanstate->prefetch_iterator = NULL;
- scanstate->prefetch_pages = 0;
- scanstate->prefetch_target = 0;
scanstate->pscan_len = 0;
scanstate->initialized = false;
scanstate->shared_tbmiterator = NULL;
- scanstate->shared_prefetch_iterator = NULL;
scanstate->pstate = NULL;
/*
@@ -812,13 +415,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->bitmapqualorig =
ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
- /*
- * Maximum number of prefetches for the tablespace if configured,
- * otherwise the current value of the effective_io_concurrency GUC.
- */
- scanstate->prefetch_maximum =
- get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
scanstate->ss.ss_currentRelation = currentRelation;
scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
@@ -909,12 +505,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
pstate->tbmiterator = 0;
- pstate->prefetch_iterator = 0;
/* Initialize the mutex */
SpinLockInit(&pstate->mutex);
- pstate->prefetch_pages = 0;
- pstate->prefetch_target = 0;
pstate->state = BM_INITIAL;
ConditionVariableInit(&pstate->cv);
@@ -946,11 +539,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
if (DsaPointerIsValid(pstate->tbmiterator))
tbm_free_shared_area(dsa, pstate->tbmiterator);
- if (DsaPointerIsValid(pstate->prefetch_iterator))
- tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
pstate->tbmiterator = InvalidDsaPointer;
- pstate->prefetch_iterator = InvalidDsaPointer;
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index c5feacbff4..9ac0fa98d0 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -180,7 +180,6 @@ struct TBMIterator
int spageptr; /* next spages index */
int schunkptr; /* next schunks index */
int schunkbit; /* next bit to check in current schunk */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
};
/*
@@ -221,7 +220,6 @@ struct TBMSharedIterator
PTEntryArray *ptbase; /* pagetable element array */
PTIterationArray *ptpages; /* sorted exact page index list */
PTIterationArray *ptchunks; /* sorted lossy page index list */
- TBMIterateResult output; /* MUST BE LAST (because variable-size) */
};
/* Local function prototypes */
@@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
* Create the TBMIterator struct, with enough trailing space to serve the
* needs of the TBMIterateResult sub-struct.
*/
- iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
- MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+ iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
iterator->tbm = tbm;
/*
@@ -966,11 +963,10 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
* be examined, but the condition must be rechecked anyway. (For ease of
* testing, recheck is always set true when ntuples < 0.)
*/
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
{
TIDBitmap *tbm = iterator->tbm;
- TBMIterateResult *output = &(iterator->output);
Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
@@ -1008,11 +1004,11 @@ tbm_iterate(TBMIterator *iterator)
chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
iterator->schunkbit++;
- return output;
+ return;
}
}
@@ -1028,16 +1024,16 @@ tbm_iterate(TBMIterator *iterator)
page = tbm->spages[iterator->spageptr];
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
iterator->spageptr++;
- return output;
+ return;
}
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
}
/*
@@ -1047,10 +1043,9 @@ tbm_iterate(TBMIterator *iterator)
* across multiple processes. We need to acquire the iterator LWLock,
* before accessing the shared members.
*/
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
{
- TBMIterateResult *output = &iterator->output;
TBMSharedIteratorState *istate = iterator->state;
PagetableEntry *ptbase = NULL;
int *idxpages = NULL;
@@ -1101,13 +1096,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
{
/* Return a lossy page indicator from the chunk */
- output->blockno = chunk_blockno;
- output->ntuples = -1;
- output->recheck = true;
+ tbmres->blockno = chunk_blockno;
+ tbmres->ntuples = -1;
+ tbmres->recheck = true;
istate->schunkbit++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
}
@@ -1117,21 +1112,21 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
int ntuples;
/* scan bitmap to extract individual offset numbers */
- ntuples = tbm_extract_page_tuple(page, output);
- output->blockno = page->blockno;
- output->ntuples = ntuples;
- output->recheck = page->recheck;
+ ntuples = tbm_extract_page_tuple(page, tbmres);
+ tbmres->blockno = page->blockno;
+ tbmres->ntuples = ntuples;
+ tbmres->recheck = page->recheck;
istate->spageptr++;
LWLockRelease(&istate->lock);
- return output;
+ return;
}
LWLockRelease(&istate->lock);
/* Nothing more in the bitmap */
- return NULL;
+ tbmres->blockno = InvalidBlockNumber;
}
/*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 670a40b4be..1122c098c7 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -352,6 +352,11 @@ typedef struct GinScanEntryData
TIDBitmap *matchBitmap;
TBMIterator *matchIterator;
TBMIterateResult *matchResult;
+ // TODO: a temporary hack to deal with the fact that I am
+ // 1) not sure if InvalidBlockNumber can come up for other reasons than exhausting the bitmap
+ // and 2) not having taken the time yet to check all the places where matchResult == NULL
+ // is used to make sure I can replace it with something else
+ TBMIterateResult *savedMatchResult;
/* used for Posting list and one page in Posting tree */
ItemPointerData *list;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 331f5c6716..c4d653e923 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
#include "access/skey.h"
#include "access/table.h" /* for backward compatibility */
#include "access/tableam.h"
+#include "nodes/execnodes.h"
#include "nodes/lockoptions.h"
#include "nodes/primnodes.h"
#include "storage/bufpage.h"
@@ -225,5 +226,6 @@ extern bool ResolveCminCmaxDuringDecoding(struct HTAB *tuplecid_data,
CommandId *cmin, CommandId *cmax);
extern void HeapCheckForSerializableConflictOut(bool valid, Relation relation, HeapTuple tuple,
Buffer buffer, Snapshot snapshot);
+extern void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate);
#endif /* HEAPAM_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 414b6b4d57..fea54384ec 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -786,7 +786,7 @@ typedef struct TableAmRoutine
* scan_bitmap_next_tuple need to exist, or neither.
*/
bool (*scan_bitmap_next_block) (TableScanDesc scan,
- struct TBMIterateResult *tbmres);
+ struct TBMIterateResult **tbmres);
/*
* Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1929,7 +1929,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
*/
static inline bool
table_scan_bitmap_next_block(TableScanDesc scan,
- struct TBMIterateResult *tbmres)
+ struct TBMIterateResult **tbmres)
{
/*
* We don't expect direct calls to table_scan_bitmap_next_block with valid
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 3b0bd5acb8..64d8c6a07c 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
ParallelContext *pcxt);
extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
ParallelWorkerContext *pwcxt);
+extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate);
#endif /* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..fc69ea55d9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1532,11 +1532,7 @@ typedef enum
/* ----------------
* ParallelBitmapHeapState information
* tbmiterator iterator for scanning current pages
- * prefetch_iterator iterator for prefetching ahead of current page
- * mutex mutual exclusion for the prefetching variable
- * and state
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
+ * mutex mutual exclusion for the state
* state current state of the TIDBitmap
* cv conditional wait variable
* phs_snapshot_data snapshot data shared to workers
@@ -1545,10 +1541,7 @@ typedef enum
typedef struct ParallelBitmapHeapState
{
dsa_pointer tbmiterator;
- dsa_pointer prefetch_iterator;
slock_t mutex;
- int prefetch_pages;
- int prefetch_target;
SharedBitmapState state;
ConditionVariable cv;
char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1559,22 +1552,16 @@ typedef struct ParallelBitmapHeapState
*
* bitmapqualorig execution state for bitmapqualorig expressions
* tbm bitmap obtained from child index scan(s)
- * tbmiterator iterator for scanning current pages
+ * tbmiterator iterator for scanning pages
* tbmres current-page data
* can_skip_fetch can we potentially skip tuple fetches in this scan?
* return_empty_tuples number of empty tuples to return
* vmbuffer buffer for visibility-map lookups
- * pvmbuffer ditto, for prefetched pages
* exact_pages total number of exact pages retrieved
* lossy_pages total number of lossy pages retrieved
- * prefetch_iterator iterator for prefetching ahead of current page
- * prefetch_pages # pages prefetch iterator is ahead of current
- * prefetch_target current target prefetch distance
- * prefetch_maximum maximum value for prefetch_target
* pscan_len size of the shared memory for parallel bitmap
* initialized is node is ready to iterate
* shared_tbmiterator shared iterator
- * shared_prefetch_iterator shared iterator for prefetching
* pstate shared state for parallel bitmap scan
* ----------------
*/
@@ -1586,19 +1573,12 @@ typedef struct BitmapHeapScanState
TBMIterator *tbmiterator;
TBMIterateResult *tbmres;
bool can_skip_fetch;
- int return_empty_tuples;
Buffer vmbuffer;
- Buffer pvmbuffer;
long exact_pages;
long lossy_pages;
- TBMIterator *prefetch_iterator;
- int prefetch_pages;
- int prefetch_target;
- int prefetch_maximum;
Size pscan_len;
bool initialized;
TBMSharedIterator *shared_tbmiterator;
- TBMSharedIterator *shared_prefetch_iterator;
ParallelBitmapHeapState *pstate;
} BitmapHeapScanState;
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index bc67166105..236de80f23 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -23,6 +23,8 @@
#define TIDBITMAP_H
#include "storage/itemptr.h"
+// TODO: not great that I am including this now
+#include "storage/buf.h"
#include "utils/dsa.h"
@@ -40,6 +42,7 @@ typedef struct TBMSharedIterator TBMSharedIterator;
typedef struct TBMIterateResult
{
BlockNumber blockno; /* page number containing tuples */
+ Buffer buffer;
int ntuples; /* -1 indicates lossy result */
bool recheck; /* should the tuples be rechecked? */
/* Note: recheck is always true if ntuples < 0 */
@@ -64,8 +67,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
extern void tbm_end_iterate(TBMIterator *iterator);
extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 9a07f06b9f..8e1aa48827 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -39,7 +39,7 @@ typedef enum IoMethod
} IoMethod;
/* We'll default to bgworker. */
-#define DEFAULT_IO_METHOD IOMETHOD_WORKER
+#define DEFAULT_IO_METHOD IOMETHOD_IO_URING
/* GUCs */
extern int io_method;
--
2.27.0