On Wed, Mar 25, 2026 at 2:54 PM Melanie Plageman <[email protected]> wrote: > > I'm trying to think of cases where the two approaches would differ so > I can decide which to do. > > > 0003 > > > > - Half the "beginscan" calls use a ternary operator directly, half sets > > a variable first (and then uses that). Often mixed in the same file. > > Shouldn't it be a bit consistent? > > Indeed.
Attached v46 addresses your feedback and has a bit of assorted cleanup in it. I started wondering if table_beginscan_strat() is a bit weird now because it has two boolean arguments that are basically just SO_ALLOW_STRAT and SO_ALLOW_SYNC -- so those are kind of letting the user set "internal" flags. Anyway, I'm not sure we should do anything about it, but it got me thinking. - Melanie
From 4216d588438aacd4023801869edc464dc2cb0921 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Wed, 3 Dec 2025 15:07:24 -0500 Subject: [PATCH v46 1/5] Track which relations are modified by a query Save the range table indexes of relations modified by a query in a bitmap in the PlannedStmt. This is derived from existing PlannedStmt members listing row marks and result relations, but precomputing it allows cheap membership checks during execution. A later commit will use this information during scans to control whether or not on-access pruning is allowed to set the visibility map -- which would be counterproductive if the query will modify the page. Relations are considered modified if they are the target of INSERT, UPDATE, DELETE, or MERGE, or if they have any row mark (including SELECT FOR UPDATE/SHARE and non-locking marks like ROW_MARK_REFERENCE). Since this bitmap is used to avoid unnecessary work, it is okay for it to be conservative. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com --- src/backend/executor/execParallel.c | 1 + src/backend/executor/nodeLockRows.c | 3 +++ src/backend/executor/nodeModifyTable.c | 20 ++++++++++++++++++++ src/backend/optimizer/plan/planner.c | 21 ++++++++++++++++++++- src/include/nodes/plannodes.h | 6 ++++++ 5 files changed, 50 insertions(+), 1 deletion(-) diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c index ac84af294c9..4f39767d033 100644 --- a/src/backend/executor/execParallel.c +++ b/src/backend/executor/execParallel.c @@ -188,6 +188,7 @@ ExecSerializePlan(Plan *plan, EState *estate) pstmt->partPruneInfos = estate->es_part_prune_infos; pstmt->rtable = estate->es_range_table; pstmt->unprunableRelids = estate->es_unpruned_relids; + pstmt->modifiedRelids = estate->es_plannedstmt->modifiedRelids; pstmt->permInfos = estate->es_rteperminfos; pstmt->resultRelations = NIL; pstmt->appendRelations = NIL; diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c index 8d865470780..38a43315f11 100644 --- a/src/backend/executor/nodeLockRows.c +++ b/src/backend/executor/nodeLockRows.c @@ -113,6 +113,9 @@ lnext: } erm->ermActive = true; + Assert(bms_is_member(erm->rti, + estate->es_plannedstmt->modifiedRelids)); + /* fetch the tuple's ctid */ datum = ExecGetJunkAttribute(slot, aerm->ctidAttNo, diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 4cd5e262e0f..b22264c343b 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -896,6 +896,14 @@ ExecInsert(ModifyTableContext *context, resultRelationDesc = resultRelInfo->ri_RelationDesc; + /* + * If this is a leaf partition we just found, it won't have a valid range + * table index. + */ + Assert(resultRelInfo->ri_RangeTableIndex == 0 || + bms_is_member(resultRelInfo->ri_RangeTableIndex, + estate->es_plannedstmt->modifiedRelids)); + /* * Open the table's indexes, if we have not done so already, so that we * can add new index entries for the inserted tuple. @@ -1523,6 +1531,9 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo, { EState *estate = context->estate; + Assert(bms_is_member(resultRelInfo->ri_RangeTableIndex, + estate->es_plannedstmt->modifiedRelids)); + return table_tuple_delete(resultRelInfo->ri_RelationDesc, tupleid, estate->es_output_cid, estate->es_snapshot, @@ -2205,6 +2216,15 @@ ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo, bool partition_constraint_failed; TM_Result result; + /* + * Tuple routing for cross-partition updates or ON CONFLICT ... DO UPDATE + * may open leaf partitions not in the range table, in which case + * ri_RangeTableIndex is 0. + */ + Assert(resultRelInfo->ri_RangeTableIndex == 0 || + bms_is_member(resultRelInfo->ri_RangeTableIndex, + estate->es_plannedstmt->modifiedRelids)); + updateCxt->crossPartUpdate = false; /* diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 42604a0f75c..de10b6fb413 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -340,8 +340,10 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions, RelOptInfo *final_rel; Path *best_path; Plan *top_plan; + Bitmapset *modifiedRelids = NULL; ListCell *lp, - *lr; + *lr, + *lc; /* * Set up global state for this planner invocation. This data is needed @@ -661,6 +663,23 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions, result->subplans = glob->subplans; result->rewindPlanIDs = glob->rewindPlanIDs; result->rowMarks = glob->finalrowmarks; + + /* + * Compute modifiedRelids from result relations and row marks. + * + * This isn't exactly what the executor will actually modify/lock at + * runtime. Runtime partition pruning may eliminate some result relations + * and some rowmarks are included that may not result in table + * modification. Conversely, leaf partitions whose result relations are + * created at the time of insert are not included here. + */ + foreach(lc, glob->resultRelations) + modifiedRelids = bms_add_member(modifiedRelids, lfirst_int(lc)); + foreach(lc, glob->finalrowmarks) + modifiedRelids = bms_add_member(modifiedRelids, + ((PlanRowMark *) lfirst(lc))->rti); + result->modifiedRelids = modifiedRelids; + result->relationOids = glob->relationOids; result->invalItems = glob->invalItems; result->paramExecTypes = glob->paramExecTypes; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index b6185825fcb..a9cf9dd0f29 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -112,6 +112,12 @@ typedef struct PlannedStmt */ Bitmapset *unprunableRelids; + /* + * RT indexes of relations modified by the query through + * UPDATE/DELETE/INSERT/MERGE or targeted by SELECT FOR UPDATE/SHARE. + */ + Bitmapset *modifiedRelids; + /* * list of RTEPermissionInfo nodes for rtable entries needing one */ -- 2.43.0
From 384753f272cbe60e77299b97153801d3a448d33b Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Thu, 19 Mar 2026 17:05:55 -0400 Subject: [PATCH v46 2/5] Thread flags through begin-scan APIs Add a user-settable flags parameter to the table_beginscan_* wrappers, index_beginscan(), table_index_fetch_begin(), and the table AM callback index_fetch_begin(). This allows users to pass additional context to be used when building the scan descriptors. For index scans, a new flags field is added to IndexFetchTableData, and the heap AM stores the caller-provided flags there in heapam_index_fetch_begin(). This introduces an extension point for follow-up work to pass per-scan information (such as whether the relation is read-only for the current query) from the executor to the AM layer. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Tomas Vondra <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com --- contrib/pgrowlocks/pgrowlocks.c | 3 +- src/backend/access/brin/brin.c | 3 +- src/backend/access/gin/gininsert.c | 3 +- src/backend/access/heap/heapam_handler.c | 9 ++- src/backend/access/index/genam.c | 6 +- src/backend/access/index/indexam.c | 10 ++-- src/backend/access/nbtree/nbtsort.c | 3 +- src/backend/access/table/tableam.c | 22 +++---- src/backend/commands/constraint.c | 3 +- src/backend/commands/copyto.c | 3 +- src/backend/commands/tablecmds.c | 13 ++-- src/backend/commands/typecmds.c | 6 +- src/backend/executor/execIndexing.c | 4 +- src/backend/executor/execReplication.c | 14 +++-- src/backend/executor/nodeBitmapHeapscan.c | 3 +- src/backend/executor/nodeIndexonlyscan.c | 9 ++- src/backend/executor/nodeIndexscan.c | 12 ++-- src/backend/executor/nodeSamplescan.c | 3 +- src/backend/executor/nodeSeqscan.c | 9 ++- src/backend/executor/nodeTidrangescan.c | 7 ++- src/backend/partitioning/partbounds.c | 3 +- src/backend/utils/adt/selfuncs.c | 3 +- src/include/access/genam.h | 6 +- src/include/access/heapam.h | 5 +- src/include/access/relscan.h | 1 + src/include/access/tableam.h | 73 +++++++++++++++-------- 26 files changed, 152 insertions(+), 84 deletions(-) diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c index ff3692c87c4..75ad379190f 100644 --- a/contrib/pgrowlocks/pgrowlocks.c +++ b/contrib/pgrowlocks/pgrowlocks.c @@ -115,7 +115,8 @@ pgrowlocks(PG_FUNCTION_ARGS) RelationGetRelationName(rel)); /* Scan the relation */ - scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL); + scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL, + 0 /* flags */ ); hscan = (HeapScanDesc) scan; attinmeta = TupleDescGetAttInMetadata(rsinfo->setDesc); diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 2a0f8c8e3b8..536493fa38a 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -2844,7 +2844,8 @@ _brin_parallel_scan_and_build(BrinBuildState *state, indexInfo->ii_Concurrent = brinshared->isconcurrent; scan = table_beginscan_parallel(heap, - ParallelTableScanFromBrinShared(brinshared)); + ParallelTableScanFromBrinShared(brinshared), + 0 /* flags */ ); reltuples = table_index_build_scan(heap, index, indexInfo, true, true, brinbuildCallbackParallel, state, scan); diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index e54782d9dd8..d4e9c9ed950 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -2068,7 +2068,8 @@ _gin_parallel_scan_and_build(GinBuildState *state, indexInfo->ii_Concurrent = ginshared->isconcurrent; scan = table_beginscan_parallel(heap, - ParallelTableScanFromGinBuildShared(ginshared)); + ParallelTableScanFromGinBuildShared(ginshared), + 0 /* flags */ ); reltuples = table_index_build_scan(heap, index, indexInfo, true, progress, ginBuildCallbackParallel, state, scan); diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index d40878928e1..8c7695ebfb9 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -81,11 +81,12 @@ heapam_slot_callbacks(Relation relation) */ static IndexFetchTableData * -heapam_index_fetch_begin(Relation rel) +heapam_index_fetch_begin(Relation rel, uint32 flags) { IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData); hscan->xs_base.rel = rel; + hscan->xs_base.flags = flags; hscan->xs_cbuf = InvalidBuffer; hscan->xs_vmbuffer = InvalidBuffer; @@ -763,7 +764,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, tableScan = NULL; heapScan = NULL; - indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0); + indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0, + 0 /* flags */ ); index_rescan(indexScan, NULL, 0, NULL, 0); } else @@ -772,7 +774,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, pgstat_progress_update_param(PROGRESS_REPACK_PHASE, PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP); - tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL); + tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL, + 0 /* flags */ ); heapScan = (HeapScanDesc) tableScan; indexScan = NULL; diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index 5e89b86a62c..03a243345bc 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -455,7 +455,8 @@ systable_beginscan(Relation heapRelation, } sysscan->iscan = index_beginscan(heapRelation, irel, - snapshot, NULL, nkeys, 0); + snapshot, NULL, nkeys, 0, + 0 /* flags */ ); index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0); sysscan->scan = NULL; @@ -716,7 +717,8 @@ systable_beginscan_ordered(Relation heapRelation, bsysscan = true; sysscan->iscan = index_beginscan(heapRelation, indexRelation, - snapshot, NULL, nkeys, 0); + snapshot, NULL, nkeys, 0, + 0 /* flags */ ); index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0); sysscan->scan = NULL; diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index fbfc33159eb..13cdbb86cd7 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -258,7 +258,8 @@ index_beginscan(Relation heapRelation, Relation indexRelation, Snapshot snapshot, IndexScanInstrumentation *instrument, - int nkeys, int norderbys) + int nkeys, int norderbys, + uint32 flags) { IndexScanDesc scan; @@ -285,7 +286,7 @@ index_beginscan(Relation heapRelation, scan->instrument = instrument; /* prepare to fetch index matches from table */ - scan->xs_heapfetch = table_index_fetch_begin(heapRelation); + scan->xs_heapfetch = table_index_fetch_begin(heapRelation, flags); return scan; } @@ -594,7 +595,8 @@ IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, IndexScanInstrumentation *instrument, int nkeys, int norderbys, - ParallelIndexScanDesc pscan) + ParallelIndexScanDesc pscan, + uint32 flags) { Snapshot snapshot; IndexScanDesc scan; @@ -616,7 +618,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, scan->instrument = instrument; /* prepare to fetch index matches from table */ - scan->xs_heapfetch = table_index_fetch_begin(heaprel); + scan->xs_heapfetch = table_index_fetch_begin(heaprel, flags); return scan; } diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 47a9bda30c9..3c444ece216 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -1928,7 +1928,8 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2, indexInfo = BuildIndexInfo(btspool->index); indexInfo->ii_Concurrent = btshared->isconcurrent; scan = table_beginscan_parallel(btspool->heap, - ParallelTableScanFromBTShared(btshared)); + ParallelTableScanFromBTShared(btshared), + 0 /* flags */ ); reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo, true, progress, _bt_build_callback, &buildstate, scan); diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index dfda1af412e..3ac4027ce11 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -118,7 +118,7 @@ table_beginscan_catalog(Relation relation, int nkeys, ScanKeyData *key) Snapshot snapshot = RegisterSnapshot(GetCatalogSnapshot(relid)); return table_beginscan_common(relation, snapshot, nkeys, key, - NULL, flags); + NULL, flags, 0); } @@ -163,10 +163,11 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, } TableScanDesc -table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan) +table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan, + uint32 flags) { Snapshot snapshot; - uint32 flags = SO_TYPE_SEQSCAN | + uint32 internal_flags = SO_TYPE_SEQSCAN | SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE; Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator)); @@ -176,7 +177,7 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan) /* Snapshot was serialized -- restore it */ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off); RegisterSnapshot(snapshot); - flags |= SO_TEMP_SNAPSHOT; + internal_flags |= SO_TEMP_SNAPSHOT; } else { @@ -185,16 +186,17 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan) } return table_beginscan_common(relation, snapshot, 0, NULL, - pscan, flags); + pscan, internal_flags, flags); } TableScanDesc table_beginscan_parallel_tidrange(Relation relation, - ParallelTableScanDesc pscan) + ParallelTableScanDesc pscan, + uint32 flags) { Snapshot snapshot; - uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; TableScanDesc sscan; + uint32 internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator)); @@ -206,7 +208,7 @@ table_beginscan_parallel_tidrange(Relation relation, /* Snapshot was serialized -- restore it */ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off); RegisterSnapshot(snapshot); - flags |= SO_TEMP_SNAPSHOT; + internal_flags |= SO_TEMP_SNAPSHOT; } else { @@ -215,7 +217,7 @@ table_beginscan_parallel_tidrange(Relation relation, } sscan = table_beginscan_common(relation, snapshot, 0, NULL, - pscan, flags); + pscan, internal_flags, flags); return sscan; } @@ -248,7 +250,7 @@ table_index_fetch_tuple_check(Relation rel, bool found; slot = table_slot_create(rel, NULL); - scan = table_index_fetch_begin(rel); + scan = table_index_fetch_begin(rel, 0 /* flags */ ); found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again, all_dead); table_index_fetch_end(scan); diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c index cc11c47b6f2..1c4f5a25ba4 100644 --- a/src/backend/commands/constraint.c +++ b/src/backend/commands/constraint.c @@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS) */ tmptid = checktid; { - IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation); + IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation, + 0 /* flags */ ); bool call_again = false; if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot, diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index faf62d959b4..e6c237d6d0f 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -1336,7 +1336,8 @@ CopyRelationTo(CopyToState cstate, Relation rel, Relation root_rel, uint64 *proc AttrMap *map = NULL; TupleTableSlot *root_slot = NULL; - scandesc = table_beginscan(rel, GetActiveSnapshot(), 0, NULL); + scandesc = table_beginscan(rel, GetActiveSnapshot(), 0, NULL, + 0 /* flags */ ); slot = table_slot_create(rel, NULL); /* diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index c69c12dc014..6dd3aed6b98 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -6411,7 +6411,8 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap) * checking all the constraints. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(oldrel, snapshot, 0, NULL); + scan = table_beginscan(oldrel, snapshot, 0, NULL, + 0 /* flags */ ); /* * Switch to per-tuple memory context and reset it for each tuple @@ -13980,8 +13981,8 @@ validateForeignKeyConstraint(char *conname, */ snapshot = RegisterSnapshot(GetLatestSnapshot()); slot = table_slot_create(rel, NULL); - scan = table_beginscan(rel, snapshot, 0, NULL); - + scan = table_beginscan(rel, snapshot, 0, NULL, + 0 /* flags */ ); perTupCxt = AllocSetContextCreate(CurrentMemoryContext, "validateForeignKeyConstraint", ALLOCSET_SMALL_SIZES); @@ -22882,7 +22883,8 @@ MergePartitionsMoveRows(List **wqueue, List *mergingPartitions, Relation newPart /* Scan through the rows. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(mergingPartition, snapshot, 0, NULL); + scan = table_beginscan(mergingPartition, snapshot, 0, NULL, + 0 /* flags */ ); /* * Switch to per-tuple memory context and reset it for each tuple @@ -23346,7 +23348,8 @@ SplitPartitionMoveRows(List **wqueue, Relation rel, Relation splitRel, /* Scan through the rows. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(splitRel, snapshot, 0, NULL); + scan = table_beginscan(splitRel, snapshot, 0, NULL, + 0 /* flags */ ); /* * Switch to per-tuple memory context and reset it for each tuple diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c index 3dab6bb5a79..115bd77af27 100644 --- a/src/backend/commands/typecmds.c +++ b/src/backend/commands/typecmds.c @@ -3185,7 +3185,8 @@ validateDomainNotNullConstraint(Oid domainoid) /* Scan all tuples in this relation */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(testrel, snapshot, 0, NULL); + scan = table_beginscan(testrel, snapshot, 0, NULL, + 0 /* flags */ ); slot = table_slot_create(testrel, NULL); while (table_scan_getnextslot(scan, ForwardScanDirection, slot)) { @@ -3266,7 +3267,8 @@ validateDomainCheckConstraint(Oid domainoid, const char *ccbin, LOCKMODE lockmod /* Scan all tuples in this relation */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(testrel, snapshot, 0, NULL); + scan = table_beginscan(testrel, snapshot, 0, NULL, + 0 /* flags */ ); slot = table_slot_create(testrel, NULL); while (table_scan_getnextslot(scan, ForwardScanDirection, slot)) { diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c index 9d071e495c6..72671013c52 100644 --- a/src/backend/executor/execIndexing.c +++ b/src/backend/executor/execIndexing.c @@ -815,7 +815,9 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index, retry: conflict = false; found_self = false; - index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0); + index_scan = index_beginscan(heap, index, + &DirtySnapshot, NULL, indnkeyatts, 0, + 0 /* flags */ ); index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0); while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot)) diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c index 2497ee7edc5..ca0d1cc6b95 100644 --- a/src/backend/executor/execReplication.c +++ b/src/backend/executor/execReplication.c @@ -205,7 +205,9 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid, skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot); /* Start an index scan. */ - scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0); + scan = index_beginscan(rel, idxrel, + &snap, NULL, skey_attoff, 0, + 0 /* flags */ ); retry: found = false; @@ -383,7 +385,8 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode, /* Start a heap scan. */ InitDirtySnapshot(snap); - scan = table_beginscan(rel, &snap, 0, NULL); + scan = table_beginscan(rel, &snap, 0, NULL, + 0 /* flags */ ); scanslot = table_slot_create(rel, NULL); retry: @@ -602,7 +605,8 @@ RelationFindDeletedTupleInfoSeq(Relation rel, TupleTableSlot *searchslot, * not yet committed or those just committed prior to the scan are * excluded in update_most_recent_deletion_info(). */ - scan = table_beginscan(rel, SnapshotAny, 0, NULL); + scan = table_beginscan(rel, SnapshotAny, 0, NULL, + 0 /* flags */ ); scanslot = table_slot_create(rel, NULL); table_rescan(scan, NULL); @@ -666,7 +670,9 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid, * not yet committed or those just committed prior to the scan are * excluded in update_most_recent_deletion_info(). */ - scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0); + scan = index_beginscan(rel, idxrel, + SnapshotAny, NULL, skey_attoff, 0, + 0 /* flags */ ); index_rescan(scan, skey, skey_attoff, NULL, 0); diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 7cf8d23c742..e58bb02db43 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -148,7 +148,8 @@ BitmapTableScanSetup(BitmapHeapScanState *node) table_beginscan_bm(node->ss.ss_currentRelation, node->ss.ps.state->es_snapshot, 0, - NULL); + NULL, + 0 /* flags */ ); } node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator; diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c index 9eab81fd1c8..f8a6671793f 100644 --- a/src/backend/executor/nodeIndexonlyscan.c +++ b/src/backend/executor/nodeIndexonlyscan.c @@ -95,7 +95,8 @@ IndexOnlyNext(IndexOnlyScanState *node) estate->es_snapshot, node->ioss_Instrument, node->ioss_NumScanKeys, - node->ioss_NumOrderByKeys); + node->ioss_NumOrderByKeys, + 0 /* flags */ ); node->ioss_ScanDesc = scandesc; @@ -794,7 +795,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node, node->ioss_Instrument, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, - piscan); + piscan, + 0 /* flags */ ); node->ioss_ScanDesc->xs_want_itup = true; node->ioss_VMBuffer = InvalidBuffer; @@ -860,7 +862,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node, node->ioss_Instrument, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, - piscan); + piscan, + 0 /* flags */ ); node->ioss_ScanDesc->xs_want_itup = true; /* diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index 06143e94c5a..3df091ac000 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -113,7 +113,8 @@ IndexNext(IndexScanState *node) estate->es_snapshot, node->iss_Instrument, node->iss_NumScanKeys, - node->iss_NumOrderByKeys); + node->iss_NumOrderByKeys, + 0 /* flags */ ); node->iss_ScanDesc = scandesc; @@ -209,7 +210,8 @@ IndexNextWithReorder(IndexScanState *node) estate->es_snapshot, node->iss_Instrument, node->iss_NumScanKeys, - node->iss_NumOrderByKeys); + node->iss_NumOrderByKeys, + 0 /* flags */ ); node->iss_ScanDesc = scandesc; @@ -1730,7 +1732,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node, node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, - piscan); + piscan, + 0 /* flags */ ); /* * If no run-time keys to calculate or they are ready, go ahead and pass @@ -1794,7 +1797,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node, node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, - piscan); + piscan, + 0 /* flags */ ); /* * If no run-time keys to calculate or they are ready, go ahead and pass diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c index 6b0d65f752f..f0e14e53fab 100644 --- a/src/backend/executor/nodeSamplescan.c +++ b/src/backend/executor/nodeSamplescan.c @@ -298,7 +298,8 @@ tablesample_init(SampleScanState *scanstate) 0, NULL, scanstate->use_bulkread, allow_sync, - scanstate->use_pagemode); + scanstate->use_pagemode, + 0 /* flags */ ); } else { diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c index 8f219f60a93..eaa8cfb6a1a 100644 --- a/src/backend/executor/nodeSeqscan.c +++ b/src/backend/executor/nodeSeqscan.c @@ -71,7 +71,8 @@ SeqNext(SeqScanState *node) */ scandesc = table_beginscan(node->ss.ss_currentRelation, estate->es_snapshot, - 0, NULL); + 0, NULL, + 0 /* flags */ ); node->ss.ss_currentScanDesc = scandesc; } @@ -375,7 +376,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node, estate->es_snapshot); shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); node->ss.ss_currentScanDesc = - table_beginscan_parallel(node->ss.ss_currentRelation, pscan); + table_beginscan_parallel(node->ss.ss_currentRelation, pscan, + 0 /* flags */ ); } /* ---------------------------------------------------------------- @@ -408,5 +410,6 @@ ExecSeqScanInitializeWorker(SeqScanState *node, pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->ss.ss_currentScanDesc = - table_beginscan_parallel(node->ss.ss_currentRelation, pscan); + table_beginscan_parallel(node->ss.ss_currentRelation, pscan, + 0 /* flags */ ); } diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c index 617713bde04..6f63e9f80d0 100644 --- a/src/backend/executor/nodeTidrangescan.c +++ b/src/backend/executor/nodeTidrangescan.c @@ -245,7 +245,8 @@ TidRangeNext(TidRangeScanState *node) scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation, estate->es_snapshot, &node->trss_mintid, - &node->trss_maxtid); + &node->trss_maxtid, + 0 /* flags */ ); node->ss.ss_currentScanDesc = scandesc; } else @@ -460,7 +461,7 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt) shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - pscan); + pscan, 0 /* flags */ ); } /* ---------------------------------------------------------------- @@ -494,5 +495,5 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node, pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - pscan); + pscan, 0 /* flags */ ); } diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index 0ca312ac27d..c0f847b43be 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -3362,7 +3362,8 @@ check_default_partition_contents(Relation parent, Relation default_rel, econtext = GetPerTupleExprContext(estate); snapshot = RegisterSnapshot(GetLatestSnapshot()); tupslot = table_slot_create(part_rel, &estate->es_tupleTable); - scan = table_beginscan(part_rel, snapshot, 0, NULL); + scan = table_beginscan(part_rel, snapshot, 0, NULL, + 0 /* flags */ ); /* * Switch to per-tuple memory context and reset it for each tuple diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 53f85ccde01..9fbbb6a8ddc 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -7178,7 +7178,8 @@ get_actual_variable_endpoint(Relation heapRel, index_scan = index_beginscan(heapRel, indexRel, &SnapshotNonVacuumable, NULL, - 1, 0); + 1, 0, + 0 /* flags */ ); /* Set it up for index-only scan */ index_scan->xs_want_itup = true; index_rescan(index_scan, scankeys, 1, NULL, 0); diff --git a/src/include/access/genam.h b/src/include/access/genam.h index 1a27bf060b3..b69320a7fc8 100644 --- a/src/include/access/genam.h +++ b/src/include/access/genam.h @@ -158,7 +158,8 @@ extern IndexScanDesc index_beginscan(Relation heapRelation, Relation indexRelation, Snapshot snapshot, IndexScanInstrumentation *instrument, - int nkeys, int norderbys); + int nkeys, int norderbys, + uint32 flags); extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation, Snapshot snapshot, IndexScanInstrumentation *instrument, @@ -184,7 +185,8 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, IndexScanInstrumentation *instrument, int nkeys, int norderbys, - ParallelIndexScanDesc pscan); + ParallelIndexScanDesc pscan, + uint32 flags); extern ItemPointer index_getnext_tid(IndexScanDesc scan, ScanDirection direction); extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot); diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 9b403203006..e2e07348f37 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -95,10 +95,7 @@ typedef struct HeapScanDescData */ ParallelBlockTableScanWorkerData *rs_parallelworkerdata; - /* - * For sequential scans and bitmap heap scans. The current heap block's - * corresponding page in the visibility map. - */ + /* Current heap block's corresponding page in the visibility map */ Buffer rs_vmbuffer; /* these fields only used in page-at-a-time mode and for bitmap scans */ diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index ce340c076f8..80ea0b437d1 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -122,6 +122,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker; typedef struct IndexFetchTableData { Relation rel; + uint32 flags; } IndexFetchTableData; struct IndexScanInstrumentation; diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 06084752245..ce5176bdf69 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -65,6 +65,16 @@ typedef enum ScanOptions SO_TEMP_SNAPSHOT = 1 << 9, } ScanOptions; +/* + * Mask of flags that are set internally by the table_beginscan_* functions + * and must not be passed by callers. + */ +#define SO_INTERNAL_FLAGS \ + (SO_TYPE_SEQSCAN | SO_TYPE_BITMAPSCAN | SO_TYPE_SAMPLESCAN | \ + SO_TYPE_TIDSCAN | SO_TYPE_TIDRANGESCAN | SO_TYPE_ANALYZE | \ + SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE | \ + SO_TEMP_SNAPSHOT) + /* * Result codes for table_{update,delete,lock_tuple}, and for visibility * routines inside table AMs. @@ -420,7 +430,7 @@ typedef struct TableAmRoutine * * Tuples for an index scan can then be fetched via index_fetch_tuple. */ - struct IndexFetchTableData *(*index_fetch_begin) (Relation rel); + struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags); /* * Reset index fetch. Typically this will release cross index fetch @@ -871,12 +881,19 @@ extern TupleTableSlot *table_slot_create(Relation relation, List **reglist); * A wrapper around the Table Access Method scan_begin callback, to centralize * error checking. All calls to ->scan_begin() should go through this * function. + * + * The caller-provided user_flags are validated against SO_INTERNAL_FLAGS to + * catch callers that accidentally pass scan-type or other internal flags. */ static TableScanDesc table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys, ScanKeyData *key, ParallelTableScanDesc pscan, - uint32 flags) + uint32 flags, uint32 user_flags) { + Assert((user_flags & SO_INTERNAL_FLAGS) == 0); + Assert((flags & ~SO_INTERNAL_FLAGS) == 0); + flags |= user_flags; + /* * We don't allow scans to be started while CheckXidAlive is set, except * via systable_beginscan() et al. See detailed comments in xact.c where @@ -894,12 +911,13 @@ table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys, */ static inline TableScanDesc table_beginscan(Relation rel, Snapshot snapshot, - int nkeys, ScanKeyData *key) + int nkeys, ScanKeyData *key, uint32 flags) { - uint32 flags = SO_TYPE_SEQSCAN | + uint32 internal_flags = SO_TYPE_SEQSCAN | SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, + internal_flags, flags); } /* @@ -928,7 +946,7 @@ table_beginscan_strat(Relation rel, Snapshot snapshot, if (allow_sync) flags |= SO_ALLOW_SYNC; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags, 0); } /* @@ -939,11 +957,12 @@ table_beginscan_strat(Relation rel, Snapshot snapshot, */ static inline TableScanDesc table_beginscan_bm(Relation rel, Snapshot snapshot, - int nkeys, ScanKeyData *key) + int nkeys, ScanKeyData *key, uint32 flags) { - uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE; + uint32 internal_flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, + internal_flags, flags); } /* @@ -957,18 +976,19 @@ static inline TableScanDesc table_beginscan_sampling(Relation rel, Snapshot snapshot, int nkeys, ScanKeyData *key, bool allow_strat, bool allow_sync, - bool allow_pagemode) + bool allow_pagemode, uint32 flags) { - uint32 flags = SO_TYPE_SAMPLESCAN; + uint32 internal_flags = SO_TYPE_SAMPLESCAN; if (allow_strat) - flags |= SO_ALLOW_STRAT; + internal_flags |= SO_ALLOW_STRAT; if (allow_sync) - flags |= SO_ALLOW_SYNC; + internal_flags |= SO_ALLOW_SYNC; if (allow_pagemode) - flags |= SO_ALLOW_PAGEMODE; + internal_flags |= SO_ALLOW_PAGEMODE; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, + internal_flags, flags); } /* @@ -981,7 +1001,7 @@ table_beginscan_tid(Relation rel, Snapshot snapshot) { uint32 flags = SO_TYPE_TIDSCAN; - return table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags); + return table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags, 0); } /* @@ -994,7 +1014,7 @@ table_beginscan_analyze(Relation rel) { uint32 flags = SO_TYPE_ANALYZE; - return table_beginscan_common(rel, NULL, 0, NULL, NULL, flags); + return table_beginscan_common(rel, NULL, 0, NULL, NULL, flags, 0); } /* @@ -1059,12 +1079,13 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS static inline TableScanDesc table_beginscan_tidrange(Relation rel, Snapshot snapshot, ItemPointer mintid, - ItemPointer maxtid) + ItemPointer maxtid, uint32 flags) { TableScanDesc sscan; - uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; + uint32 internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; - sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags); + sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, + internal_flags, flags); /* Set the range of TIDs to scan */ sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid); @@ -1139,7 +1160,8 @@ extern void table_parallelscan_initialize(Relation rel, * Caller must hold a suitable lock on the relation. */ extern TableScanDesc table_beginscan_parallel(Relation relation, - ParallelTableScanDesc pscan); + ParallelTableScanDesc pscan, + uint32 flags); /* * Begin a parallel tid range scan. `pscan` needs to have been initialized @@ -1149,7 +1171,8 @@ extern TableScanDesc table_beginscan_parallel(Relation relation, * Caller must hold a suitable lock on the relation. */ extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation, - ParallelTableScanDesc pscan); + ParallelTableScanDesc pscan, + uint32 flags); /* * Restart a parallel scan. Call this in the leader process. Caller is @@ -1175,8 +1198,10 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan) * Tuples for an index scan can then be fetched via table_index_fetch_tuple(). */ static inline IndexFetchTableData * -table_index_fetch_begin(Relation rel) +table_index_fetch_begin(Relation rel, uint32 flags) { + Assert((flags & SO_INTERNAL_FLAGS) == 0); + /* * We don't allow scans to be started while CheckXidAlive is set, except * via systable_beginscan() et al. See detailed comments in xact.c where @@ -1185,7 +1210,7 @@ table_index_fetch_begin(Relation rel) if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan)) elog(ERROR, "scan started during logical decoding"); - return rel->rd_tableam->index_fetch_begin(rel); + return rel->rd_tableam->index_fetch_begin(rel, flags); } /* -- 2.43.0
From 105c2c2c0057ee9945cf6ec1c32061f617f627a2 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Mon, 2 Mar 2026 16:31:33 -0500 Subject: [PATCH v46 3/5] Pass down information on table modification to scan node Pass down information to sequential scan, index [only] scan, bitmap table scan, sample scan, and TID range scan nodes on whether or not the query modifies the relation being scanned. A later commit will use this information to update the VM during on-access pruning only if the relation is not modified by the query. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Andrey Borodin <[email protected]> Reviewed-by: Tomas Vondra <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru --- src/backend/executor/execUtils.c | 8 ++++++++ src/backend/executor/nodeBitmapHeapscan.c | 3 ++- src/backend/executor/nodeIndexonlyscan.c | 9 ++++++--- src/backend/executor/nodeIndexscan.c | 12 ++++++++---- src/backend/executor/nodeSamplescan.c | 3 ++- src/backend/executor/nodeSeqscan.c | 10 +++++++--- src/backend/executor/nodeTidrangescan.c | 11 ++++++++--- src/include/access/tableam.h | 3 +++ src/include/executor/executor.h | 2 ++ 9 files changed, 46 insertions(+), 15 deletions(-) diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index 9886ab06b69..d2ffe28e010 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -736,6 +736,14 @@ ExecRelationIsTargetRelation(EState *estate, Index scanrelid) return list_member_int(estate->es_plannedstmt->resultRelations, scanrelid); } +/* Return true if the scan node's relation is not modified by the query */ +bool +ScanRelIsReadOnly(ScanState *ss) +{ + return !bms_is_member(((Scan *) ss->ps.plan)->scanrelid, + ss->ps.state->es_plannedstmt->modifiedRelids); +} + /* ---------------------------------------------------------------- * ExecOpenScanRelation * diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index e58bb02db43..7096e6f8645 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -149,7 +149,8 @@ BitmapTableScanSetup(BitmapHeapScanState *node) node->ss.ps.state->es_snapshot, 0, NULL, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); } node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator; diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c index f8a6671793f..3971e54d7da 100644 --- a/src/backend/executor/nodeIndexonlyscan.c +++ b/src/backend/executor/nodeIndexonlyscan.c @@ -96,7 +96,8 @@ IndexOnlyNext(IndexOnlyScanState *node) node->ioss_Instrument, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); node->ioss_ScanDesc = scandesc; @@ -796,7 +797,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, piscan, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); node->ioss_ScanDesc->xs_want_itup = true; node->ioss_VMBuffer = InvalidBuffer; @@ -863,7 +865,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, piscan, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); node->ioss_ScanDesc->xs_want_itup = true; /* diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index 3df091ac000..09df10dd78a 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -114,7 +114,8 @@ IndexNext(IndexScanState *node) node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); node->iss_ScanDesc = scandesc; @@ -211,7 +212,8 @@ IndexNextWithReorder(IndexScanState *node) node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); node->iss_ScanDesc = scandesc; @@ -1733,7 +1735,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node, node->iss_NumScanKeys, node->iss_NumOrderByKeys, piscan, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); /* * If no run-time keys to calculate or they are ready, go ahead and pass @@ -1798,7 +1801,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node, node->iss_NumScanKeys, node->iss_NumOrderByKeys, piscan, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); /* * If no run-time keys to calculate or they are ready, go ahead and pass diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c index f0e14e53fab..98fab36fbdc 100644 --- a/src/backend/executor/nodeSamplescan.c +++ b/src/backend/executor/nodeSamplescan.c @@ -299,7 +299,8 @@ tablesample_init(SampleScanState *scanstate) scanstate->use_bulkread, allow_sync, scanstate->use_pagemode, - 0 /* flags */ ); + ScanRelIsReadOnly(&scanstate->ss) ? + SO_HINT_REL_READ_ONLY : 0); } else { diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c index eaa8cfb6a1a..2f4c18051cd 100644 --- a/src/backend/executor/nodeSeqscan.c +++ b/src/backend/executor/nodeSeqscan.c @@ -72,7 +72,8 @@ SeqNext(SeqScanState *node) scandesc = table_beginscan(node->ss.ss_currentRelation, estate->es_snapshot, 0, NULL, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); node->ss.ss_currentScanDesc = scandesc; } @@ -375,9 +376,11 @@ ExecSeqScanInitializeDSM(SeqScanState *node, pscan, estate->es_snapshot); shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); + node->ss.ss_currentScanDesc = table_beginscan_parallel(node->ss.ss_currentRelation, pscan, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); } /* ---------------------------------------------------------------- @@ -411,5 +414,6 @@ ExecSeqScanInitializeWorker(SeqScanState *node, pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->ss.ss_currentScanDesc = table_beginscan_parallel(node->ss.ss_currentRelation, pscan, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); } diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c index 6f63e9f80d0..f83a72e3635 100644 --- a/src/backend/executor/nodeTidrangescan.c +++ b/src/backend/executor/nodeTidrangescan.c @@ -246,7 +246,8 @@ TidRangeNext(TidRangeScanState *node) estate->es_snapshot, &node->trss_mintid, &node->trss_maxtid, - 0 /* flags */ ); + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); node->ss.ss_currentScanDesc = scandesc; } else @@ -461,7 +462,9 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt) shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - pscan, 0 /* flags */ ); + pscan, + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); } /* ---------------------------------------------------------------- @@ -495,5 +498,7 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node, pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - pscan, 0 /* flags */ ); + pscan, + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0); } diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index ce5176bdf69..014c686a5de 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -63,6 +63,9 @@ typedef enum ScanOptions /* unregister snapshot at scan end? */ SO_TEMP_SNAPSHOT = 1 << 9, + + /* set if the query doesn't modify the relation */ + SO_HINT_REL_READ_ONLY = 1 << 10, } ScanOptions; /* diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index 07f4b1f7490..7979a17e4ec 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -690,6 +690,8 @@ extern void ExecCreateScanSlotFromOuterPlan(EState *estate, extern bool ExecRelationIsTargetRelation(EState *estate, Index scanrelid); +extern bool ScanRelIsReadOnly(ScanState *ss); + extern Relation ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags); extern void ExecInitRangeTable(EState *estate, List *rangeTable, List *permInfos, -- 2.43.0
From e8f9da0d1ca12ab03cb58e4283dfd4111aa9fc2c Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Fri, 27 Feb 2026 16:33:40 -0500 Subject: [PATCH v46 4/5] Allow on-access pruning to set pages all-visible Many queries do not modify the underlying relation. For such queries, if on-access pruning occurs during the scan, we can check whether the page has become all-visible and update the visibility map accordingly. Previously, only vacuum and COPY FREEZE marked pages as all-visible or all-frozen. This commit implements on-access VM setting for sequential scans as well as for the underlying heap relation in index scans and bitmap heap scans. Setting the visibility map on-access can avoid write amplification caused by vacuum later needing to set the page all-visible, trigger a write and potentially FPI. It also allows more frequent index-only scans, since they require pages to be marked all-visible in the VM. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Kirill Reshke <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com --- src/backend/access/heap/heapam.c | 3 +- src/backend/access/heap/heapam_handler.c | 6 ++-- src/backend/access/heap/pruneheap.c | 46 +++++++++++++++++------- src/backend/access/heap/vacuumlazy.c | 2 +- src/include/access/heapam.h | 3 +- 5 files changed, 43 insertions(+), 17 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 044f385e477..dbdf6521c42 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -633,7 +633,8 @@ heap_prepare_pagescan(TableScanDesc sscan) /* * Prune and repair fragmentation for the whole page, if possible. */ - heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer); + heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer, + (sscan->rs_flags & SO_HINT_REL_READ_ONLY)); /* * We must hold share lock on the buffer content while examining tuple diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 8c7695ebfb9..d59b423c8ad 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -149,7 +149,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan, */ if (prev_buf != hscan->xs_cbuf) heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf, - &hscan->xs_vmbuffer); + &hscan->xs_vmbuffer, + (hscan->xs_base.flags & SO_HINT_REL_READ_ONLY)); } /* Obtain share-lock on the buffer so we can examine visibility */ @@ -2546,7 +2547,8 @@ BitmapHeapScanNextBlock(TableScanDesc scan, /* * Prune and repair fragmentation for the whole page, if possible. */ - heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer); + heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer, + scan->rs_flags & SO_HINT_REL_READ_ONLY); /* * We must hold share lock on the buffer content while examining tuple diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 6693af8da7f..d83fd26b274 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -44,6 +44,8 @@ typedef struct bool mark_unused_now; /* whether to attempt freezing tuples */ bool attempt_freeze; + /* whether to attempt setting the VM */ + bool attempt_set_vm; struct VacuumCutoffs *cutoffs; Relation relation; @@ -232,7 +234,8 @@ static void page_verify_redirects(Page page); static bool heap_page_will_freeze(bool did_tuple_hint_fpi, bool do_prune, bool do_hint_prune, PruneState *prstate); -static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason); +static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason, + bool do_prune, bool do_freeze); /* @@ -253,7 +256,8 @@ static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason); * unpinning *vmbuffer. */ void -heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer) +heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer, + bool rel_read_only) { Page page = BufferGetPage(buffer); TransactionId prune_xid; @@ -336,6 +340,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer) * current implementation. */ params.options = HEAP_PAGE_PRUNE_ALLOW_FAST_PATH; + if (rel_read_only) + params.options |= HEAP_PAGE_PRUNE_SET_VM; heap_page_prune_and_freeze(¶ms, &presult, &dummy_off_loc, NULL, NULL); @@ -392,6 +398,7 @@ prune_freeze_setup(PruneFreezeParams *params, /* cutoffs must be provided if we will attempt freezing */ Assert(!(params->options & HEAP_PAGE_PRUNE_FREEZE) || params->cutoffs); prstate->attempt_freeze = (params->options & HEAP_PAGE_PRUNE_FREEZE) != 0; + prstate->attempt_set_vm = (params->options & HEAP_PAGE_PRUNE_SET_VM) != 0; prstate->cutoffs = params->cutoffs; prstate->relation = params->relation; prstate->block = BufferGetBlockNumber(params->buffer); @@ -461,9 +468,8 @@ prune_freeze_setup(PruneFreezeParams *params, * We track whether the page will be all-visible/all-frozen at the end of * pruning and freezing. While examining tuple visibility, we'll set * set_all_visible to false if there are tuples on the page not visible to - * all running and future transactions. set_all_visible is always - * maintained but only VACUUM will set the VM if the page ends up being - * all-visible. + * all running and future transactions. If enabled for this scan, we will + * set the VM if the page ends up being all-visible. * * We also keep track of the newest live XID, which is used to calculate * the snapshot conflict horizon for a WAL record setting the VM. @@ -920,21 +926,37 @@ heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum, * This function does not actually set the VM bits or page-level visibility * hint, PD_ALL_VISIBLE. * + * This should be called only after do_freeze has been decided (and do_prune + * has been set), as these factor into our heuristic-based decision. + * * Returns true if one or both VM bits should be set and false otherwise. */ static bool -heap_page_will_set_vm(PruneState *prstate, PruneReason reason) +heap_page_will_set_vm(PruneState *prstate, PruneReason reason, + bool do_prune, bool do_freeze) { - /* - * Though on-access pruning maintains prstate->set_all_visible, we don't - * set the VM on-access for now. - */ - if (reason == PRUNE_ON_ACCESS) + if (!prstate->attempt_set_vm) return false; if (!prstate->set_all_visible) return false; + /* + * If this is an on-access call and we're not actually pruning, avoid + * setting the visibility map if it would newly dirty the heap page or, if + * the page is already dirty, if doing so would require including a + * full-page image (FPI) of the heap page in the WAL. This situation + * should be rare, as on-access pruning is only attempted when + * pd_prune_xid is valid. + */ + if (reason == PRUNE_ON_ACCESS && !do_prune && !do_freeze && + (!BufferIsDirty(prstate->buffer) || XLogCheckBufferNeedsBackup(prstate->buffer))) + { + prstate->set_all_visible = false; + prstate->set_all_frozen = false; + return false; + } + prstate->new_vmbits = VISIBILITYMAP_ALL_VISIBLE; if (prstate->set_all_frozen) @@ -1167,7 +1189,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params, Assert(!prstate.set_all_frozen || prstate.set_all_visible); Assert(!prstate.set_all_visible || (prstate.lpdead_items == 0)); - do_set_vm = heap_page_will_set_vm(&prstate, params->reason); + do_set_vm = heap_page_will_set_vm(&prstate, params->reason, do_prune, do_freeze); /* * new_vmbits should be 0 regardless of whether or not the page is diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index f698c2d899b..24001b27387 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -2021,7 +2021,7 @@ lazy_scan_prune(LVRelState *vacrel, .buffer = buf, .vmbuffer = vmbuffer, .reason = PRUNE_VACUUM_SCAN, - .options = HEAP_PAGE_PRUNE_FREEZE, + .options = HEAP_PAGE_PRUNE_FREEZE | HEAP_PAGE_PRUNE_SET_VM, .vistest = vacrel->vistest, .cutoffs = &vacrel->cutoffs, }; diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index e2e07348f37..f2a009141be 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -43,6 +43,7 @@ #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW (1 << 0) #define HEAP_PAGE_PRUNE_FREEZE (1 << 1) #define HEAP_PAGE_PRUNE_ALLOW_FAST_PATH (1 << 2) +#define HEAP_PAGE_PRUNE_SET_VM (1 << 3) typedef struct BulkInsertStateData *BulkInsertState; typedef struct GlobalVisState GlobalVisState; @@ -431,7 +432,7 @@ extern TransactionId heap_index_delete_tuples(Relation rel, /* in heap/pruneheap.c */ extern void heap_page_prune_opt(Relation relation, Buffer buffer, - Buffer *vmbuffer); + Buffer *vmbuffer, bool rel_read_only); extern void heap_page_prune_and_freeze(PruneFreezeParams *params, PruneFreezeResult *presult, OffsetNumber *off_loc, -- 2.43.0
From 9d6d6c2529700e4fe381dbc55ef172ba13882fab Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Tue, 29 Jul 2025 16:12:56 -0400 Subject: [PATCH v46 5/5] Set pd_prune_xid on insert MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that visibility map (VM) updates can occur during read-only queries, it makes sense to also set the page’s pd_prune_xid hint during inserts and on the new page during updates. This enables heap_page_prune_and_freeze() to set the VM all-visible after a page is filled with newly inserted tuples the first time it is read. This means the page will get set all-visible when it is still in shared buffers and avoid potential I/O amplification when vacuum later has to scan the page and set it all-visible. It also enables index-only scans of newly inserted data much sooner. This change also addresses a long-standing note in heap_insert() and heap_multi_insert(), which observed that setting pd_prune_xid would help clean up aborted insertions sooner. Without it, such tuples might linger until VACUUM, whereas now they can be pruned earlier. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com --- src/backend/access/heap/heapam.c | 39 +++++++++++++++++---------- src/backend/access/heap/heapam_xlog.c | 19 ++++++++++++- src/backend/access/heap/pruneheap.c | 18 ++++++------- 3 files changed, 51 insertions(+), 25 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index dbdf6521c42..cdaf57e3f12 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2156,6 +2156,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; Buffer buffer; + Page page; Buffer vmbuffer = InvalidBuffer; bool all_visible_cleared = false; @@ -2182,6 +2183,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, &vmbuffer, NULL, 0); + page = BufferGetPage(buffer); + /* * We're about to do the actual insert -- but check for conflict first, to * avoid possibly having to roll back work we've just done. @@ -2205,25 +2208,30 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, RelationPutHeapTuple(relation, buffer, heaptup, (options & HEAP_INSERT_SPECULATIVE) != 0); - if (PageIsAllVisible(BufferGetPage(buffer))) + if (PageIsAllVisible(page)) { all_visible_cleared = true; - PageClearAllVisible(BufferGetPage(buffer)); + PageClearAllVisible(page); visibilitymap_clear(relation, ItemPointerGetBlockNumber(&(heaptup->t_self)), vmbuffer, VISIBILITYMAP_VALID_BITS); } /* - * XXX Should we set PageSetPrunable on this page ? + * Set pd_prune_xid to trigger heap_page_prune_and_freeze() once the page + * is full so that we can set the page all-visible in the VM on the next + * page access. * - * The inserting transaction may eventually abort thus making this tuple - * DEAD and hence available for pruning. Though we don't want to optimize - * for aborts, if no other tuple in this page is UPDATEd/DELETEd, the - * aborted tuple will never be pruned until next vacuum is triggered. + * Setting pd_prune_xid is also handy if the inserting transaction + * eventually aborts making this tuple DEAD and hence available for + * pruning. If no other tuple in this page is UPDATEd/DELETEd, the aborted + * tuple would never otherwise be pruned until next vacuum is triggered. * - * If you do add PageSetPrunable here, add it in heap_xlog_insert too. + * Don't set it if we are in bootstrap mode or we are inserting a frozen + * tuple, as there is no further pruning/freezing needed in those cases. */ + if (TransactionIdIsNormal(xid) && !(options & HEAP_INSERT_FROZEN)) + PageSetPrunable(page, xid); MarkBufferDirty(buffer); @@ -2233,7 +2241,6 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, xl_heap_insert xlrec; xl_heap_header xlhdr; XLogRecPtr recptr; - Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; int bufflags = 0; @@ -2598,8 +2605,12 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, } /* - * XXX Should we set PageSetPrunable on this page ? See heap_insert() + * Set pd_prune_xid. See heap_insert() for more on why we do this when + * inserting tuples. This only makes sense if we aren't already + * setting the page frozen in the VM and we're not in bootstrap mode. */ + if (!all_frozen_set && TransactionIdIsNormal(xid)) + PageSetPrunable(page, xid); MarkBufferDirty(buffer); @@ -4141,12 +4152,12 @@ l2: * the subsequent page pruning will be a no-op and the hint will be * cleared. * - * XXX Should we set hint on newbuf as well? If the transaction aborts, - * there would be a prunable tuple in the newbuf; but for now we choose - * not to optimize for aborts. Note that heap_xlog_update must be kept in - * sync if this decision changes. + * We set the new page prunable as well. See heap_insert() for more on why + * we do this when inserting tuples. */ PageSetPrunable(page, xid); + if (newbuf != buffer) + PageSetPrunable(newpage, xid); if (use_hot_update) { diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c index 1302bb13e18..f3f419d3dc1 100644 --- a/src/backend/access/heap/heapam_xlog.c +++ b/src/backend/access/heap/heapam_xlog.c @@ -450,6 +450,14 @@ heap_xlog_insert(XLogReaderState *record) freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ + /* + * Set the page prunable to trigger on-access pruning later, which may + * set the page all-visible in the VM. See comments in heap_insert(). + */ + if (TransactionIdIsNormal(XLogRecGetXid(record)) && + !HeapTupleHeaderXminFrozen(htup)) + PageSetPrunable(page, XLogRecGetXid(record)); + PageSetLSN(page, lsn); if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED) @@ -599,12 +607,19 @@ heap_xlog_multi_insert(XLogReaderState *record) if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED) PageClearAllVisible(page); - /* XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible */ + /* + * XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible. If + * we are not setting the page frozen, then set the page's prunable + * hint so that we trigger on-access pruning later which may set the + * page all-visible in the VM. + */ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET) { PageSetAllVisible(page); PageClearPrunable(page); } + else + PageSetPrunable(page, XLogRecGetXid(record)); MarkBufferDirty(buffer); } @@ -921,6 +936,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update) freespace = PageGetHeapFreeSpace(npage); PageSetLSN(npage, lsn); + /* See heap_insert() for why we set pd_prune_xid on insert */ + PageSetPrunable(npage, XLogRecGetXid(record)); MarkBufferDirty(nbuffer); } diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index d83fd26b274..bb364f53a44 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -275,7 +275,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer, /* * First check whether there's any chance there's something to prune, * determining the appropriate horizon is a waste if there's no prune_xid - * (i.e. no updates/deletes left potentially dead tuples around). + * (i.e. no updates/deletes left potentially dead tuples around and no + * inserts inserted new tuples that may be visible to all). */ prune_xid = PageGetPruneXid(page); if (!TransactionIdIsValid(prune_xid)) @@ -1918,17 +1919,14 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum) prstate->set_all_visible = false; prstate->set_all_frozen = false; - /* The page should not be marked all-visible */ - if (PageIsAllVisible(page)) - heap_page_fix_vm_corruption(prstate, offnum, - VM_CORRUPT_TUPLE_VISIBILITY); - /* - * If we wanted to optimize for aborts, we might consider marking - * the page prunable when we see INSERT_IN_PROGRESS. But we - * don't. See related decisions about when to mark the page - * prunable in heapam.c. + * Though there is nothing "prunable" on the page, we maintain + * pd_prune_xid for inserts so that we have the opportunity to + * mark them all-visible during the next round of pruning. */ + heap_prune_record_prunable(prstate, + HeapTupleHeaderGetXmin(htup), + offnum); break; case HEAPTUPLE_DELETE_IN_PROGRESS: -- 2.43.0
