On 14/03/2024 06:54, Dilip Kumar wrote:
On Wed, Mar 13, 2024 at 9:25 PM Robert Haas <robertmh...@gmail.com> wrote:

On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar <dilipbal...@gmail.com> wrote:
Andres already commented on the snapshot stuff on an earlier patch
version, and that's much nicer with this version. However, I don't
understand why a parallel bitmap heap scan needs to do anything at all
with the snapshot, even before these patches. The parallel worker
infrastructure already passes the active snapshot from the leader to the
parallel worker. Why does bitmap heap scan code need to do that too?

Yeah thinking on this now it seems you are right that the parallel
infrastructure is already passing the active snapshot so why do we
need it again.  Then I checked other low scan nodes like indexscan and
seqscan and it seems we are doing the same things there as well.
Check for SerializeSnapshot() in table_parallelscan_initialize() and
index_parallelscan_initialize() which are being called from
ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
respectively.

I remember thinking about this when I was writing very early parallel
query code. It seemed to me that there must be some reason why the
EState has a snapshot, as opposed to just using the active snapshot,
and so I took care to propagate that snapshot, which is used for the
leader's scans, to the worker scans also. Now, if the EState doesn't
need to contain a snapshot, then all of that mechanism is unnecessary,
but I don't see how it can be right for the leader to do
table_beginscan() using estate->es_snapshot and the worker to use the
active snapshot.

Yeah, that's a very valid point. So I think now Heikki/Melanie might
have got an answer to their question, about the thought process behind
serializing the snapshot for each scan node.  And the same thing is
followed for BitmapHeapNode as well.

I see. Thanks, understanding the thought process helps.

So when a parallel table or index scan runs in the executor as part of a query, we could just use the active snapshot. But there are some other callers of parallel table scans that don't use the executor, namely parallel index builds. For those it makes sense to pass the snapshot for the scan independent of the active snapshot.

A parallel bitmap heap scan isn't really a parallel scan as far as the table AM is concerned, though. It's more like an independent bitmap heap scan in each worker process, nodeBitmapHeapscan.c does all the coordination of which blocks to scan. So I think that table_parallelscan_initialize() was the wrong role model, and we should still remove the snapshot serialization code from nodeBitmapHeapscan.c.


Digging deeper into the question of whether es_snapshot == GetActiveSnapshot() is a valid assumption:

<deep dive>

es_snapshot is copied from the QueryDesc in standard_ExecutorStart(). Looking at the callers of ExecutorStart(), they all get the QueryDesc by calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any callers changing the active snapshot between the ExecutorStart() and ExecutorRun() calls either. In pquery.c, we explicitly PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds.

_SPI_execute_plan() has code to deal with the possibility that the active snapshot is not set. That seems fishy; do we really support SPI without any snapshot? I'm inclined to turn that into an error. I ran the regression tests with an "Assert(ActiveSnapshotSet())" there, and everything worked.

If es_snapshot was different from the active snapshot, things would get weird, even without parallel query. The scans would use es_snapshot for the visibility checks, but any functions you execute in quals would use the active snapshot.

We could double down on that assumption, and remove es_snapshot altogether and use GetActiveSnapshot() instead. And perhaps add "PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun().

</deep dive>

In summary, this es_snapshot stuff is a bit confusing and could use some cleanup. But for now, I'd like to just add some assertions and a comments about this, and remove the snapshot serialization from bitmap heap scan node, to make it consistent with other non-parallel scan nodes (it's not really a parallel scan as far as the table AM is concerned). See attached patch, which is the same as previous patch with some extra assertions.

--
Heikki Linnakangas
Neon (https://neon.tech)
From 44eb77162ccc6f18a0a5eaccf0c083d4fefd076f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Thu, 14 Mar 2024 11:56:11 +0200
Subject: [PATCH v2 1/1] Remove redundant snapshot copying from parallel leader
 to workers

The parallel query infrastructure copies the leader backend's active
snapshot to the worker processes. But BitmapHeapScan node also had
bespoken code to pass the snapshot from leader to the worker. That was
redundant, so remove it.

The removed code was analogous to the snapshot serialization in
table_parallelscan_initialize(), but that was the wrong role model. A
parallel bitmap heap scan is more like an independent non-parallel
bitmap heap scan in each parallel worker as far as the table AM is
concerned, because the coordination is done in nodeBitmapHeapscan.c,
and the table AM doesn't need to know anything about it.

This relies on the assumption that es_snapshot ==
GetActiveSnapshot(). That's not a new assumption, things would get
weird if you used the QueryDesc's snapshot for visibility checks in
the scans, but the active snapshot for evaluating quals, for
example. This could use some refactoring and cleanup, but for now,
just add some assertions.

Discussion: https://www.postgresql.org/message-id/5f3b9d59-0f43-419d-80ca-6d04c07cf...@iki.fi
---
 src/backend/access/table/tableam.c        | 10 ----------
 src/backend/executor/execMain.c           |  6 ++++++
 src/backend/executor/execParallel.c       |  7 +++++++
 src/backend/executor/nodeBitmapHeapscan.c | 17 ++---------------
 src/include/access/tableam.h              |  5 -----
 src/include/nodes/execnodes.h             |  4 ----
 6 files changed, 15 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a..e57a0b7ea3 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -120,16 +120,6 @@ table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key)
 											NULL, flags);
 }
 
-void
-table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot)
-{
-	Assert(IsMVCCSnapshot(snapshot));
-
-	RegisterSnapshot(snapshot);
-	scan->rs_snapshot = snapshot;
-	scan->rs_flags |= SO_TEMP_SNAPSHOT;
-}
-
 
 /* ----------------------------------------------------------------------------
  * Parallel table scan related functions.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 940499cc61..7eb1f7d020 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -147,6 +147,9 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	Assert(queryDesc != NULL);
 	Assert(queryDesc->estate == NULL);
 
+	/* caller must ensure the query's snapshot is active */
+	Assert(GetActiveSnapshot() == queryDesc->snapshot);
+
 	/*
 	 * If the transaction is read-only, we need to check if any writes are
 	 * planned to non-temporary tables.  EXPLAIN is considered read-only.
@@ -319,6 +322,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	Assert(estate != NULL);
 	Assert(!(estate->es_top_eflags & EXEC_FLAG_EXPLAIN_ONLY));
 
+	/* caller must ensure the query's snapshot is active */
+	Assert(GetActiveSnapshot() == estate->es_snapshot);
+
 	/*
 	 * Switch into per-query memory context
 	 */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 3f84c002dc..8c53d1834e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -720,6 +720,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/*
+	 * InitializeParallelDSM() passes the active snapshot to the parallel
+	 * worker, which uses it to set es_snapshot.  Make sure we don't set
+	 * es_snapshot differently in the child.
+	 */
+	Assert(GetActiveSnapshot() == estate->es_snapshot);
+
 	/* Everyone's had a chance to ask for space, so now create the DSM. */
 	InitializeParallelDSM(pcxt);
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 345b67649e..ca548e44eb 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -721,7 +721,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->prefetch_iterator = NULL;
 	scanstate->prefetch_pages = 0;
 	scanstate->prefetch_target = 0;
-	scanstate->pscan_len = 0;
 	scanstate->initialized = false;
 	scanstate->shared_tbmiterator = NULL;
 	scanstate->shared_prefetch_iterator = NULL;
@@ -841,13 +840,7 @@ void
 ExecBitmapHeapEstimate(BitmapHeapScanState *node,
 					   ParallelContext *pcxt)
 {
-	EState	   *estate = node->ss.ps.state;
-
-	node->pscan_len = add_size(offsetof(ParallelBitmapHeapState,
-										phs_snapshot_data),
-							   EstimateSnapshotSpace(estate->es_snapshot));
-
-	shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(ParallelBitmapHeapState));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 }
 
@@ -862,14 +855,13 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 							ParallelContext *pcxt)
 {
 	ParallelBitmapHeapState *pstate;
-	EState	   *estate = node->ss.ps.state;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
 	/* If there's no DSA, there are no workers; initialize nothing. */
 	if (dsa == NULL)
 		return;
 
-	pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
+	pstate = shm_toc_allocate(pcxt->toc, sizeof(ParallelBitmapHeapState));
 
 	pstate->tbmiterator = 0;
 	pstate->prefetch_iterator = 0;
@@ -881,7 +873,6 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 	pstate->state = BM_INITIAL;
 
 	ConditionVariableInit(&pstate->cv);
-	SerializeSnapshot(estate->es_snapshot, pstate->phs_snapshot_data);
 
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pstate);
 	node->pstate = pstate;
@@ -927,13 +918,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 							   ParallelWorkerContext *pwcxt)
 {
 	ParallelBitmapHeapState *pstate;
-	Snapshot	snapshot;
 
 	Assert(node->ss.ps.state->es_query_dsa != NULL);
 
 	pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->pstate = pstate;
-
-	snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
-	table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
 }
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d..8249b37bbf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1038,11 +1038,6 @@ table_rescan_set_params(TableScanDesc scan, struct ScanKeyData *key,
 										 allow_pagemode);
 }
 
-/*
- * Update snapshot used by the scan.
- */
-extern void table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot);
-
 /*
  * Return next tuple from `scan`, store in slot.
  */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..27614ab50f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1689,7 +1689,6 @@ typedef enum
  *		prefetch_target			current target prefetch distance
  *		state					current state of the TIDBitmap
  *		cv						conditional wait variable
- *		phs_snapshot_data		snapshot data shared to workers
  * ----------------
  */
 typedef struct ParallelBitmapHeapState
@@ -1701,7 +1700,6 @@ typedef struct ParallelBitmapHeapState
 	int			prefetch_target;
 	SharedBitmapState state;
 	ConditionVariable cv;
-	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
 } ParallelBitmapHeapState;
 
 /* ----------------
@@ -1721,7 +1719,6 @@ typedef struct ParallelBitmapHeapState
  *		prefetch_pages	   # pages prefetch iterator is ahead of current
  *		prefetch_target    current target prefetch distance
  *		prefetch_maximum   maximum value for prefetch_target
- *		pscan_len		   size of the shared memory for parallel bitmap
  *		initialized		   is node is ready to iterate
  *		shared_tbmiterator	   shared iterator
  *		shared_prefetch_iterator shared iterator for prefetching
@@ -1745,7 +1742,6 @@ typedef struct BitmapHeapScanState
 	int			prefetch_pages;
 	int			prefetch_target;
 	int			prefetch_maximum;
-	Size		pscan_len;
 	bool		initialized;
 	TBMSharedIterator *shared_tbmiterator;
 	TBMSharedIterator *shared_prefetch_iterator;
-- 
2.39.2

Reply via email to