OK so after mulling this over for a long time, here's a patch for this.
It solves the problem by making the partition descriptor no longer be
cached if a detached partition is omitted.  Any transaction that needs a
partition descriptor that excludes detached partitions, will have to
recreate the partdesc from scratch.  To support this, I changed the
output boolean semantics: instead of "does this partdesc include an
detached partitions" as in your patch, it now is "are there any detached
partitions".  But whenever no detached partitions exist, or when all
partitions including detached are requested, then the cached descriptor
is returned (because that necessarily has to be correct).  The main
difference this has to your patch is that we always keep the descriptor
in the cache and don't rebuild anything, unless a detached partition is
present.

I flipped the find_inheritance_children() input boolean, from
"include_detached" to "omit_detached".  This is more natural, given the
internal behavior.  You could argue to propagate that naming change to
the partdesc.h API and PartitionDirectory, but I don't think there's a
need for that.

I ran all the detach-partition-concurrently tests under
debug_invalidate_system_caches_always=1 and everything passes.

I experimented with keeping a separate cached partition descriptor that
omits the detached partition, but that brings back some trouble; I
couldn't find a way to invalidate such a cached entry in a reasonable
way.  I have the patch for that, if somebody wants to play with it.

-- 
Álvaro Herrera       Valdivia, Chile
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."                        (Paul Thomas)
>From f95149f0fee451b8a2b49861dc813aeddf384f8f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Tue, 20 Apr 2021 18:01:18 -0400
Subject: [PATCH] Fix relcache hazard with detached partitions
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During queries run by ri_triggers.c queries, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition.  Which is bogus: once the detach operation
completes, the row becomes an orphan.

However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query.  This commit changes the
caching code so that the partition descriptors we keep in relcache are
only those that do not contain any detached partition, *or* those that
are requested to contain all partitions.  When a partdesc-without-
detached-partitions is requested, we create one afresh each time.

This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus.  This commit also corrects that expected output.

Author: Amit Langote <amitlangot...@gmail.com>
Author: Álvaro Herrera <alvhe...@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215...@sss.pgh.pa.us
---
 src/backend/catalog/pg_inherits.c             | 64 +++++++++++--------
 src/backend/commands/tablecmds.c              | 22 +++----
 src/backend/commands/trigger.c                |  4 +-
 src/backend/executor/execPartition.c          | 15 ++---
 src/backend/partitioning/partdesc.c           | 30 +++++----
 src/include/catalog/pg_inherits.h             |  4 +-
 src/include/partitioning/partdesc.h           | 10 ++-
 .../detach-partition-concurrently-4.out       |  1 +
 8 files changed, 88 insertions(+), 62 deletions(-)

diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index bb8b2249b1..bb6b1686ee 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -52,13 +52,19 @@ typedef struct SeenRelsEntry
  * then no locks are acquired, but caller must beware of race conditions
  * against possible DROPs of child relations.
  *
- * include_detached says to include all partitions, even if they're marked
- * detached.  Passing it as false means they might or might not be included,
- * depending on the visibility of the pg_inherits row for the active snapshot.
+ * If a partition's pg_inherits row is marked "detach pending",
+ * *detached_exist (if not null) is set true, otherwise it is set false.
+ *
+ * If omit_detached is true and there is an active snapshot (not the same as
+ * the catalog snapshot used to scan pg_inherits!) and a pg_inherits tuple
+ * marked "detach pending" is visible to that snapshot, then that partition is
+ * omitted from the output list.  This makes partitions invisible depending on
+ * whether the transaction that marked those partitions as detached appears
+ * committed to the active snapshot.
  */
 List *
-find_inheritance_children(Oid parentrelId, bool include_detached,
-						  LOCKMODE lockmode)
+find_inheritance_children(Oid parentrelId, bool omit_detached,
+						  LOCKMODE lockmode, bool *detached_exist)
 {
 	List	   *list = NIL;
 	Relation	relation;
@@ -78,6 +84,9 @@ find_inheritance_children(Oid parentrelId, bool include_detached,
 	if (!has_subclass(parentrelId))
 		return NIL;
 
+	if (detached_exist)
+		*detached_exist = false;
+
 	/*
 	 * Scan pg_inherits and build a working array of subclass OIDs.
 	 */
@@ -99,29 +108,34 @@ find_inheritance_children(Oid parentrelId, bool include_detached,
 	{
 		/*
 		 * Cope with partitions concurrently being detached.  When we see a
-		 * partition marked "detach pending", we only include it in the set of
-		 * visible partitions if caller requested all detached partitions, or
-		 * if its pg_inherits tuple's xmin is still visible to the active
-		 * snapshot.
+		 * partition marked "detach pending", we omit it from the returned
+		 * descriptor if caller requested that and the tuple's xmin does not
+		 * appear in progress to the active snapshot.  (If there's no active
+		 * snapshot set, that means we're not running a user query, so it's OK
+		 * to always include detached partitions in that case; if the xmin is
+		 * still running to the active snapshot, then the partition has not
+		 * been detached yet and so we include it.)
 		 *
-		 * The reason for this check is that we want to avoid seeing the
+		 * The reason for this hack is that we want to avoid seeing the
 		 * partition as alive in RI queries during REPEATABLE READ or
-		 * SERIALIZABLE transactions.  (If there's no active snapshot set,
-		 * that means we're not running a user query, so it's OK to always
-		 * include detached partitions in that case.)
+		 * SERIALIZABLE transactions.
 		 */
-		if (((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhdetachpending &&
-			!include_detached &&
-			ActiveSnapshotSet())
+		if (((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhdetachpending)
 		{
-			TransactionId xmin;
-			Snapshot snap;
+			if (detached_exist)
+				*detached_exist = true;
 
-			xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
-			snap = GetActiveSnapshot();
+			if (omit_detached && ActiveSnapshotSet())
+			{
+				TransactionId xmin;
+				Snapshot	snap;
 
-			if (!XidInMVCCSnapshot(xmin, snap))
-				continue;
+				xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
+				snap = GetActiveSnapshot();
+
+				if (!XidInMVCCSnapshot(xmin, snap))
+					continue;
+			}
 		}
 
 		inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
@@ -235,8 +249,8 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
 		ListCell   *lc;
 
 		/* Get the direct children of this rel */
-		currentchildren = find_inheritance_children(currentrel, false,
-													lockmode);
+		currentchildren = find_inheritance_children(currentrel, true,
+													lockmode, NULL);
 
 		/*
 		 * Add to the queue only those children not already seen. This avoids
@@ -524,7 +538,7 @@ DeleteInheritsTuple(Oid inhrelid, Oid inhparent, bool expect_detach_pending,
 		parent = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhparent;
 		if (!OidIsValid(inhparent) || parent == inhparent)
 		{
-			bool	detach_pending;
+			bool		detach_pending;
 
 			detach_pending =
 				((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhdetachpending;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 096a6f2891..31da913d2b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3507,7 +3507,7 @@ renameatt_internal(Oid myrelid,
 		 * expected_parents will only be 0 if we are not already recursing.
 		 */
 		if (expected_parents == 0 &&
-			find_inheritance_children(myrelid, false, NoLock) != NIL)
+			find_inheritance_children(myrelid, true, NoLock, NULL) != NIL)
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 					 errmsg("inherited column \"%s\" must be renamed in child tables too",
@@ -3706,7 +3706,7 @@ rename_constraint_internal(Oid myrelid,
 		else
 		{
 			if (expected_parents == 0 &&
-				find_inheritance_children(myrelid, false, NoLock) != NIL)
+				find_inheritance_children(myrelid, true, NoLock, NULL) != NIL)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 						 errmsg("inherited constraint \"%s\" must be renamed in child tables too",
@@ -6580,7 +6580,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 */
 	if (colDef->identity &&
 		recurse &&
-		find_inheritance_children(myrelid, false, NoLock) != NIL)
+		find_inheritance_children(myrelid, true, NoLock, NULL) != NIL)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 				 errmsg("cannot recursively add identity column to table that has child tables")));
@@ -6826,7 +6826,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 * use find_all_inheritors to do it in one pass.
 	 */
 	children =
-		find_inheritance_children(RelationGetRelid(rel), false, lockmode);
+		find_inheritance_children(RelationGetRelid(rel), true, lockmode, NULL);
 
 	/*
 	 * If we are told not to recurse, there had better not be any child
@@ -7689,7 +7689,7 @@ ATPrepDropExpression(Relation rel, AlterTableCmd *cmd, bool recurse, bool recurs
 	 * resulting state can be properly dumped and restored.
 	 */
 	if (!recurse &&
-		find_inheritance_children(RelationGetRelid(rel), false, lockmode))
+		find_inheritance_children(RelationGetRelid(rel), true, lockmode, NULL))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("ALTER TABLE / DROP EXPRESSION must be applied to child tables too")));
@@ -8297,7 +8297,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 	 * use find_all_inheritors to do it in one pass.
 	 */
 	children =
-		find_inheritance_children(RelationGetRelid(rel), false, lockmode);
+		find_inheritance_children(RelationGetRelid(rel), true, lockmode, NULL);
 
 	if (children)
 	{
@@ -8785,7 +8785,7 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 * use find_all_inheritors to do it in one pass.
 	 */
 	children =
-		find_inheritance_children(RelationGetRelid(rel), false, lockmode);
+		find_inheritance_children(RelationGetRelid(rel), true, lockmode, NULL);
 
 	/*
 	 * Check if ONLY was specified with ALTER TABLE.  If so, allow the
@@ -11318,8 +11318,8 @@ ATExecDropConstraint(Relation rel, const char *constrName,
 	 * use find_all_inheritors to do it in one pass.
 	 */
 	if (!is_no_inherit_constraint)
-		children =
-			find_inheritance_children(RelationGetRelid(rel), false, lockmode);
+		children = find_inheritance_children(RelationGetRelid(rel), true,
+											 lockmode, NULL);
 	else
 		children = NIL;
 
@@ -11703,8 +11703,8 @@ ATPrepAlterColumnType(List **wqueue,
 		}
 	}
 	else if (!recursing &&
-			 find_inheritance_children(RelationGetRelid(rel), false,
-									   NoLock) != NIL)
+			 find_inheritance_children(RelationGetRelid(rel), true,
+									   NoLock, NULL) != NIL)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 				 errmsg("type of inherited column \"%s\" must be changed in child tables too",
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3421014e47..2dce62999d 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1141,8 +1141,8 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 			ListCell   *l;
 			List	   *idxs = NIL;
 
-			idxs = find_inheritance_children(indexOid, false,
-											 ShareRowExclusiveLock);
+			idxs = find_inheritance_children(indexOid, true,
+											 ShareRowExclusiveLock, NULL);
 			foreach(l, idxs)
 				childTbls = lappend_oid(childTbls,
 										IndexGetRelation(lfirst_oid(l),
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 99780ebb96..bbfc9ef3fa 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -991,14 +991,11 @@ ExecInitPartitionDispatchInfo(EState *estate,
 
 	/*
 	 * For data modification, it is better that executor does not include
-	 * partitions being detached, except in snapshot-isolation mode.  This
-	 * means that a read-committed transaction immediately gets a "no
-	 * partition for tuple" error when a tuple is inserted into a partition
-	 * that's being detached concurrently, but a transaction in repeatable-
-	 * read mode can still use the partition.  Note that because partition
-	 * detach uses ShareLock on the partition (which conflicts with DML),
-	 * we're certain that the detach won't be able to complete until any
-	 * inserting transaction is done.
+	 * partitions being detached, except when running in snapshot-isolation
+	 * mode.  This means that a read-committed transaction immediately gets a
+	 * "no partition for tuple" error when a tuple is inserted into a
+	 * partition that's being detached concurrently, but a transaction in
+	 * repeatable-read mode can still use such a partition.
 	 */
 	if (estate->es_partition_directory == NULL)
 		estate->es_partition_directory =
@@ -1571,7 +1568,7 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 	ListCell   *lc;
 	int			i;
 
-	/* Executor must always include detached partitions */
+	/* For data reading, executor always includes detached partitions */
 	if (estate->es_partition_directory == NULL)
 		estate->es_partition_directory =
 			CreatePartitionDirectory(estate->es_query_cxt, true);
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 58570fecfd..7d019a72be 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -47,7 +47,8 @@ typedef struct PartitionDirectoryEntry
 	PartitionDesc pd;
 } PartitionDirectoryEntry;
 
-static void RelationBuildPartitionDesc(Relation rel, bool include_detached);
+static PartitionDesc RelationBuildPartitionDesc(Relation rel,
+												bool include_detached);
 
 
 /*
@@ -64,14 +65,13 @@ static void RelationBuildPartitionDesc(Relation rel, bool include_detached);
 PartitionDesc
 RelationGetPartitionDesc(Relation rel, bool include_detached)
 {
-	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		return NULL;
+	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
-	if (unlikely(rel->rd_partdesc == NULL ||
-				 rel->rd_partdesc->includes_detached != include_detached))
-		RelationBuildPartitionDesc(rel, include_detached);
+	if (likely(rel->rd_partdesc &&
+			   (!rel->rd_partdesc->detached_exist || include_detached)))
+		return rel->rd_partdesc;
 
-	return rel->rd_partdesc;
+	return RelationBuildPartitionDesc(rel, include_detached);
 }
 
 /*
@@ -89,7 +89,7 @@ RelationGetPartitionDesc(Relation rel, bool include_detached)
  * that some of our callees allocate memory on their own which would be leaked
  * permanently.
  */
-static void
+static PartitionDesc
 RelationBuildPartitionDesc(Relation rel, bool include_detached)
 {
 	PartitionDesc partdesc;
@@ -98,6 +98,7 @@ RelationBuildPartitionDesc(Relation rel, bool include_detached)
 	PartitionBoundSpec **boundspecs = NULL;
 	Oid		   *oids = NULL;
 	bool	   *is_leaf = NULL;
+	bool		detached_exist;
 	ListCell   *cell;
 	int			i,
 				nparts;
@@ -112,8 +113,8 @@ RelationBuildPartitionDesc(Relation rel, bool include_detached)
 	 * concurrently, whatever this function returns will be accurate as of
 	 * some well-defined point in time.
 	 */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), include_detached,
-										NoLock);
+	inhoids = find_inheritance_children(RelationGetRelid(rel), !include_detached,
+										NoLock, &detached_exist);
 	nparts = list_length(inhoids);
 
 	/* Allocate working arrays for OIDs, leaf flags, and boundspecs. */
@@ -234,6 +235,7 @@ RelationBuildPartitionDesc(Relation rel, bool include_detached)
 	partdesc = (PartitionDescData *)
 		MemoryContextAllocZero(new_pdcxt, sizeof(PartitionDescData));
 	partdesc->nparts = nparts;
+	partdesc->detached_exist = detached_exist;
 	/* If there are no partitions, the rest of the partdesc can stay zero */
 	if (nparts > 0)
 	{
@@ -241,7 +243,6 @@ RelationBuildPartitionDesc(Relation rel, bool include_detached)
 		partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
 		partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid));
 		partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool));
-		partdesc->includes_detached = include_detached;
 
 		/*
 		 * Assign OIDs from the original array into mapped indexes of the
@@ -276,7 +277,12 @@ RelationBuildPartitionDesc(Relation rel, bool include_detached)
 	if (rel->rd_pdcxt != NULL)
 		MemoryContextSetParent(rel->rd_pdcxt, new_pdcxt);
 	rel->rd_pdcxt = new_pdcxt;
-	rel->rd_partdesc = partdesc;
+
+	/* Store it into relcache, but only if no detached partitions exist */
+	if (!detached_exist)
+		rel->rd_partdesc = partdesc;
+
+	return partdesc;
 }
 
 /*
diff --git a/src/include/catalog/pg_inherits.h b/src/include/catalog/pg_inherits.h
index 6d07e1b302..4d28ede5a6 100644
--- a/src/include/catalog/pg_inherits.h
+++ b/src/include/catalog/pg_inherits.h
@@ -50,8 +50,8 @@ DECLARE_INDEX(pg_inherits_parent_index, 2187, on pg_inherits using btree(inhpare
 #define InheritsParentIndexId  2187
 
 
-extern List *find_inheritance_children(Oid parentrelId, bool include_detached,
-									   LOCKMODE lockmode);
+extern List *find_inheritance_children(Oid parentrelId, bool omit_detached,
+									   LOCKMODE lockmode, bool *detached_exist);
 extern List *find_all_inheritors(Oid parentrelId, LOCKMODE lockmode,
 								 List **parents);
 extern bool has_subclass(Oid relationId);
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index 7f03ff4271..41485c5b10 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -17,11 +17,19 @@
 
 /*
  * Information about partitions of a partitioned table.
+ *
+ * For partitioned tables where detached partitions exist, we only cache the
+ * descriptor that includes all partitions, including detached; when we're
+ * requested a descriptor without the detached partitions, we create one
+ * afresh each time.  (The reason for this is that the set of detached
+ * partitions that are visible to each caller depends on the snapshot it has,
+ * so it's pretty much impossible to evict a descriptor from cache at the
+ * right time.)
  */
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	bool		includes_detached;	/* Does it include detached partitions */
+	bool		detached_exist; /* Are there any detached partitions? */
 	Oid		   *oids;			/* Array of 'nparts' elements containing
 								 * partition OIDs in order of the their bounds */
 	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
diff --git a/src/test/isolation/expected/detach-partition-concurrently-4.out b/src/test/isolation/expected/detach-partition-concurrently-4.out
index 90a75cb077..2167675374 100644
--- a/src/test/isolation/expected/detach-partition-concurrently-4.out
+++ b/src/test/isolation/expected/detach-partition-concurrently-4.out
@@ -324,6 +324,7 @@ a
 1              
 2              
 step s1insert: insert into d4_fk values (1);
+ERROR:  insert or update on table "d4_fk" violates foreign key constraint "d4_fk_a_fkey"
 step s1c: commit;
 
 starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s s1insert s1c
-- 
2.20.1

Reply via email to