Hello Here's my take on this feature, owing to David Rowley's version.
Firstly, I took Robert's advice and removed the CONCURRENTLY keyword from the syntax. We just do it that way always. When there's a default partition, only that partition is locked with an AEL; all the rest is locked with ShareUpdateExclusive only. I added some isolation tests for it -- they all pass for me. There are two main ideas supporting this patch: 1. The Partition descriptor cache module (partcache.c) now contains a long-lived hash table that lists all the current partition descriptors; when an invalidation message is received for a relation, we unlink the partdesc from the hash table *but do not free it*. The hash table-linked partdesc is rebuilt again in the future, when requested, so many copies might exist in memory for one partitioned table. 2. Snapshots have their own cache (hash table) of partition descriptors. If a partdesc is requested and the snapshot has already obtained that partdesc, the original one is returned -- we don't request a new one from partcache. Then there are a few other implementation details worth mentioning: 3. parallel query: when a worker starts on a snapshot that has a partition descriptor cache, we need to transmit those partdescs from leader via shmem ... but we cannot send the full struct, so we just send the OID list of partitions, then rebuild the descriptor in the worker. Side effect: if a partition is detached right between the leader taking the partdesc and the worker starting, the partition loses its relpartbound column, so it's not possible to reconstruct the partdesc. In this case, we raise an error. Hopefully this should be rare. 4. If a partitioned table is dropped, but was listed in a snapshot's partdesc cache, and then parallel query starts, the worker will try to restore the partdesc for that table, but there are no catalog rows for it. The implementation choice here is to ignore the table and move on. I would like to just remove the partdesc from the snapshot, but that would require a relcache inval callback, and a) it'd kill us to scan all snapshots for every relation drop; b) it doesn't work anyway because we don't have any way to distinguish invals arriving because of DROP from invals arriving because of anything else, say ANALYZE. 5. snapshots are copied a lot. Copies share the same hash table as the "original", because surely all copies should see the same partition descriptor. This leads to the pinning/unpinning business you see for the structs in snapmgr.c. Some known defects: 6. this still leaks memory. Not as terribly as my earlier prototypes, but clearly it's something that I need to address. 7. I've considered the idea of tracking snapshot-partdescs in resowner.c to prevent future memory leak mistakes. Not done yet. Closely related to item 6. 8. Header changes may need some cleanup yet -- eg. I'm not sure snapmgr.h compiles standalone. 9. David Rowley recently pointed out that we can modify CREATE TABLE .. PARTITION OF to likewise not obtain AEL anymore. Apparently it just requires removal of three lines in MergeAttributes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 3c9c03c997..79571e3a38 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -3614,7 +3614,7 @@ StorePartitionBound(Relation rel, Relation parent, PartitionBoundSpec *bound) * relcache entry for that partition every time a partition is added or * removed. */ - defaultPartOid = get_default_oid_from_partdesc(RelationGetPartitionDesc(parent)); + defaultPartOid = get_default_oid_from_partdesc(lookup_partdesc_cache(parent)); if (OidIsValid(defaultPartOid)) CacheInvalidateRelcacheByRelid(defaultPartOid); diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c index f4057a9f15..0b7dd2c612 100644 --- a/src/backend/catalog/pg_constraint.c +++ b/src/backend/catalog/pg_constraint.c @@ -33,6 +33,7 @@ #include "utils/builtins.h" #include "utils/fmgroids.h" #include "utils/lsyscache.h" +#include "utils/partcache.h" #include "utils/rel.h" #include "utils/syscache.h" #include "utils/tqual.h" @@ -753,7 +754,7 @@ clone_fk_constraints(Relation pg_constraint, Relation parentRel, if (partRel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE && subclone != NIL) { - PartitionDesc partdesc = RelationGetPartitionDesc(partRel); + PartitionDesc partdesc = lookup_partdesc_cache(partRel); int i; for (i = 0; i < partdesc->nparts; i++) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 906d711378..c027567c95 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -876,7 +876,7 @@ DefineIndex(Oid relationId, */ if (!stmt->relation || stmt->relation->inh) { - PartitionDesc partdesc = RelationGetPartitionDesc(rel); + PartitionDesc partdesc = lookup_partdesc_cache(rel); int nparts = partdesc->nparts; Oid *part_oids = palloc(sizeof(Oid) * nparts); bool invalidate_parent = false; diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 153aec263e..b3ec820d6e 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -830,7 +830,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId, * lock the partition so as to avoid a deadlock. */ defaultPartOid = - get_default_oid_from_partdesc(RelationGetPartitionDesc(parent)); + get_default_oid_from_partdesc(lookup_partdesc_cache(parent)); if (OidIsValid(defaultPartOid)) defaultRel = heap_open(defaultPartOid, AccessExclusiveLock); @@ -3614,9 +3614,15 @@ AlterTableGetLockLevel(List *cmds) cmd_lockmode = AlterTableGetRelOptionsLockLevel((List *) cmd->def); break; + /* + * Attaching and detaching partitions can be done + * concurrently. The default partition (if there's one) will + * have to be locked with AccessExclusive, but that's done + * elsewhere. + */ case AT_AttachPartition: case AT_DetachPartition: - cmd_lockmode = AccessExclusiveLock; + cmd_lockmode = ShareUpdateExclusiveLock; break; default: /* oops */ @@ -5903,7 +5909,7 @@ ATPrepDropNotNull(Relation rel, bool recurse, bool recursing) */ if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { - PartitionDesc partdesc = RelationGetPartitionDesc(rel); + PartitionDesc partdesc = lookup_partdesc_cache(rel); Assert(partdesc != NULL); if (partdesc->nparts > 0 && !recurse && !recursing) @@ -6048,7 +6054,7 @@ ATPrepSetNotNull(Relation rel, bool recurse, bool recursing) */ if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { - PartitionDesc partdesc = RelationGetPartitionDesc(rel); + PartitionDesc partdesc = lookup_partdesc_cache(rel); if (partdesc && partdesc->nparts > 0 && !recurse && !recursing) ereport(ERROR, @@ -7749,7 +7755,7 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel, { PartitionDesc partdesc; - partdesc = RelationGetPartitionDesc(rel); + partdesc = lookup_partdesc_cache(rel); for (i = 0; i < partdesc->nparts; i++) { @@ -14023,7 +14029,7 @@ QueuePartitionConstraintValidation(List **wqueue, Relation scanrel, } else if (scanrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { - PartitionDesc partdesc = RelationGetPartitionDesc(scanrel); + PartitionDesc partdesc = lookup_partdesc_cache(scanrel); int i; for (i = 0; i < partdesc->nparts; i++) @@ -14083,10 +14089,11 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd) /* * We must lock the default partition if one exists, because attaching a - * new partition will change its partition constraint. + * new partition will change its partition constraint. We must use + * AccessExclusiveLock here, to avoid routing any tuples to it that would + * belong in the newly attached partition. */ - defaultPartOid = - get_default_oid_from_partdesc(RelationGetPartitionDesc(rel)); + defaultPartOid = get_default_oid_from_partdesc(lookup_partdesc_cache(rel)); if (OidIsValid(defaultPartOid)) LockRelationOid(defaultPartOid, AccessExclusiveLock); @@ -14673,11 +14680,12 @@ ATExecDetachPartition(Relation rel, RangeVar *name) ListCell *cell; /* - * We must lock the default partition, because detaching this partition - * will change its partition constraint. + * We must lock the default partition if one exists, because detaching + * this partition will change its partition constraint. We must use + * AccessExclusiveLock here, to prevent concurrent routing of tuples using + * the obsolete partition constraint. */ - defaultPartOid = - get_default_oid_from_partdesc(RelationGetPartitionDesc(rel)); + defaultPartOid = get_default_oid_from_partdesc(lookup_partdesc_cache(rel)); if (OidIsValid(defaultPartOid)) LockRelationOid(defaultPartOid, AccessExclusiveLock); @@ -14911,7 +14919,7 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name) RelationGetRelationName(partIdx)))); /* Make sure it indexes a partition of the other index's table */ - partDesc = RelationGetPartitionDesc(parentTbl); + partDesc = lookup_partdesc_cache(parentTbl); found = false; for (i = 0; i < partDesc->nparts; i++) { @@ -15046,6 +15054,7 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl) int tuples = 0; HeapTuple inhTup; bool updated = false; + PartitionDesc partdesc; Assert(partedIdx->rd_rel->relkind == RELKIND_PARTITIONED_INDEX); @@ -15085,7 +15094,8 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl) * If we found as many inherited indexes as the partitioned table has * partitions, we're good; update pg_index to set indisvalid. */ - if (tuples == RelationGetPartitionDesc(partedTbl)->nparts) + partdesc = lookup_partdesc_cache(partedTbl); + if (tuples == partdesc->nparts) { Relation idxRel; HeapTuple newtup; diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index 240e85e391..676bb2851f 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -55,6 +55,7 @@ #include "utils/inval.h" #include "utils/lsyscache.h" #include "utils/memutils.h" +#include "utils/partcache.h" #include "utils/rel.h" #include "utils/snapmgr.h" #include "utils/syscache.h" @@ -1089,7 +1090,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString, */ if (partition_recurse) { - PartitionDesc partdesc = RelationGetPartitionDesc(rel); + PartitionDesc partdesc = lookup_partdesc_cache(rel); List *idxs = NIL; List *childTbls = NIL; ListCell *l; @@ -1857,7 +1858,7 @@ EnableDisableTrigger(Relation rel, const char *tgname, if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE && (TRIGGER_FOR_ROW(oldtrig->tgtype))) { - PartitionDesc partdesc = RelationGetPartitionDesc(rel); + PartitionDesc partdesc = lookup_partdesc_cache(rel); int i; for (i = 0; i < partdesc->nparts; i++) diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 0bcb2377c3..98524ac093 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -30,6 +30,7 @@ #include "utils/rel.h" #include "utils/rls.h" #include "utils/ruleutils.h" +#include "utils/snapmgr.h" /*----------------------- @@ -950,7 +951,6 @@ get_partition_dispatch_recurse(Relation rel, Relation parent, List **pds, List **leaf_part_oids) { TupleDesc tupdesc = RelationGetDescr(rel); - PartitionDesc partdesc = RelationGetPartitionDesc(rel); PartitionKey partkey = RelationGetPartitionKey(rel); PartitionDispatch pd; int i; @@ -963,7 +963,7 @@ get_partition_dispatch_recurse(Relation rel, Relation parent, pd->reldesc = rel; pd->key = partkey; pd->keystate = NIL; - pd->partdesc = partdesc; + pd->partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), rel); if (parent != NULL) { /* @@ -1004,10 +1004,10 @@ get_partition_dispatch_recurse(Relation rel, Relation parent, * corresponding sub-partition; otherwise, we've identified the correct * partition. */ - pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int)); - for (i = 0; i < partdesc->nparts; i++) + pd->indexes = (int *) palloc(pd->partdesc->nparts * sizeof(int)); + for (i = 0; i < pd->partdesc->nparts; i++) { - Oid partrelid = partdesc->oids[i]; + Oid partrelid = pd->partdesc->oids[i]; if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE) { @@ -1515,7 +1515,8 @@ ExecCreatePartitionPruneState(PlanState *planstate, */ partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex); partkey = RelationGetPartitionKey(partrel); - partdesc = RelationGetPartitionDesc(partrel); + partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), + partrel); n_steps = list_length(pinfo->pruning_steps); diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c index d5720518a8..8602cb676a 100644 --- a/src/backend/optimizer/prep/prepunion.c +++ b/src/backend/optimizer/prep/prepunion.c @@ -49,8 +49,10 @@ #include "parser/parse_coerce.h" #include "parser/parsetree.h" #include "utils/lsyscache.h" +#include "utils/partcache.h" #include "utils/rel.h" #include "utils/selfuncs.h" +#include "utils/snapmgr.h" #include "utils/syscache.h" @@ -1580,13 +1582,11 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) oldrelation = heap_open(parentOID, NoLock); /* Scan the inheritance set and expand it */ - if (RelationGetPartitionDesc(oldrelation) != NULL) + if (rte->relkind == RELKIND_PARTITIONED_TABLE) { - Assert(rte->relkind == RELKIND_PARTITIONED_TABLE); - /* - * If this table has partitions, recursively expand them in the order - * in which they appear in the PartitionDesc. While at it, also + * If this is a partitioned table, recursively expand the partitions in the + * order in which they appear in the PartitionDesc. While at it, also * extract the partition key columns of all the partitioned tables. */ expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, @@ -1670,11 +1670,12 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, int i; RangeTblEntry *childrte; Index childRTindex; - PartitionDesc partdesc = RelationGetPartitionDesc(parentrel); + PartitionDesc partdesc; check_stack_depth(); /* A partitioned table should always have a partition descriptor. */ + partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), parentrel); Assert(partdesc); Assert(parentrte->inh); diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 46de00460d..2bf5dc8775 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1905,7 +1905,7 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); - partdesc = RelationGetPartitionDesc(relation); + partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), relation); partkey = RelationGetPartitionKey(relation); rel->part_scheme = find_partition_scheme(root, relation); Assert(partdesc != NULL && rel->part_scheme != NULL); diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index c94f73aadc..725bbe1977 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -308,7 +308,7 @@ check_new_partition_bound(char *relname, Relation parent, PartitionBoundSpec *spec) { PartitionKey key = RelationGetPartitionKey(parent); - PartitionDesc partdesc = RelationGetPartitionDesc(parent); + PartitionDesc partdesc = lookup_partdesc_cache(parent); PartitionBoundInfo boundinfo = partdesc->boundinfo; ParseState *pstate = make_parsestate(NULL); int with = -1; @@ -1415,13 +1415,15 @@ get_qual_for_list(Relation parent, PartitionBoundSpec *spec) /* * For default list partition, collect datums for all the partitions. The * default partition constraint should check that the partition key is - * equal to none of those. + * equal to none of those. Using the cached version of the PartitionDesc + * is fine for default partitions since an AEL lock must be obtained to + * add partitions to a table which has a default partition. */ if (spec->is_default) { int i; int ndatums = 0; - PartitionDesc pdesc = RelationGetPartitionDesc(parent); + PartitionDesc pdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), parent); PartitionBoundInfo boundinfo = pdesc->boundinfo; if (boundinfo) @@ -1621,7 +1623,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec, if (spec->is_default) { List *or_expr_args = NIL; - PartitionDesc pdesc = RelationGetPartitionDesc(parent); + PartitionDesc pdesc = lookup_partdesc_cache(parent); Oid *inhoids = pdesc->oids; int nparts = pdesc->nparts, i; diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 908f62d37e..a4be3dea6a 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -1761,6 +1761,9 @@ GetSnapshotData(Snapshot snapshot) snapshot->regd_count = 0; snapshot->copied = false; + /* this is set later, if appropriate */ + snapshot->partdescs = NULL; + if (old_snapshot_threshold < 0) { /* diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c index 5757301d05..be39db7a22 100644 --- a/src/backend/utils/cache/partcache.c +++ b/src/backend/utils/cache/partcache.c @@ -29,20 +29,50 @@ #include "optimizer/planner.h" #include "partitioning/partbounds.h" #include "utils/builtins.h" +#include "utils/catcache.h" #include "utils/datum.h" +#include "utils/inval.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/partcache.h" #include "utils/rel.h" #include "utils/syscache.h" +/* + * We keep a partition descriptor cache (partcache), separate from relcache, + * for partitioned tables. Each entry points to a PartitionDesc struct. + * + * On relcache invalidations, the then-current partdesc for the involved + * relation is removed from the hash table, but not freed; instead, its + * containing memory context is reparented to TopTransactionContext. This + * way, it continues to be available for the current transaction, but newly + * planned queries will obtain a fresh descriptor. + * + * XXX doing it this way amounts to a transaction-long memory leak. + * This is not terrible, because these objects are typically a few hundred + * to a few thousand bytes at the most, so we can live with that. + * + * This is what we would like to do instead: + * Partcache entries are reference-counted and live beyond relcache + * invalidations, to protect callers that need to work with consistent + * partition descriptor entries. On relcache invalidations, the "current" + * partdesc for the involved relation is removed from the hash table, but not + * freed; a pointer to the existing entry is kept in a separate hash table, + * from where it is removed later when the refcount drops to zero. + */ +/* The partition descriptor hashtable, searched by lookup_partdesc_cache */ +static HTAB *PartCacheHash = NULL; + +static void PartCacheRelCallback(Datum arg, Oid relid); +static PartitionDesc BuildPartitionDesc(Relation rel, List *oids); static List *generate_partition_qual(Relation rel); static int32 qsort_partition_hbound_cmp(const void *a, const void *b); static int32 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg); static int32 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg); +static void create_partcache_hashtab(int nelems); /* @@ -59,6 +89,8 @@ static int32 qsort_partition_rbound_cmp(const void *a, const void *b, * context the current context except in very brief code sections, out of fear * that some of our callees allocate memory on their own which would be leaked * permanently. + * + * XXX this function should be in relcache.c. */ void RelationBuildPartitionKey(Relation relation) @@ -251,14 +283,105 @@ RelationBuildPartitionKey(Relation relation) } /* - * RelationBuildPartitionDesc - * Form rel's partition descriptor + * lookup_partdesc_cache * - * Not flushed from the cache by RelationClearRelation() unless changed because - * of addition or removal of partition. + * Fetch the partition descriptor cache entry for the specified relation. */ -void -RelationBuildPartitionDesc(Relation rel) +PartitionDesc +lookup_partdesc_cache(Relation partedrel) +{ + Oid relid = RelationGetRelid(partedrel); + PartdescCacheEntry *entry; + bool found; + + if (PartCacheHash == NULL) + { + /* First time through: set up hash table */ + create_partcache_hashtab(64); + /* Also set up callback for SI invalidations */ + CacheRegisterRelcacheCallback(PartCacheRelCallback, (Datum) 0); + } + + /* If the hashtable has an entry, we're done. */ + entry = (PartdescCacheEntry *) hash_search(PartCacheHash, + (void *) &relid, + HASH_ENTER, &found); + if (found) + return entry->partdesc; + + /* None found; gotta create one */ + entry->partdesc = BuildPartitionDesc(partedrel, NIL); + + return entry->partdesc; +} + +/* + * PartCacheRelCallback + * Relcache inval callback function + * + * When a relcach inval is received, we must not make the partcache entry + * disappear -- it may still be visible to some snapshot. Keep it around + * instead, but unlink it from the global hash table. We do reparent its + * memory context to be a child of the current transaction context, so that it + * goes away as soon as the current transaction finishes. No snapshot can + * live longer than that. + * + * On reset, we delete the entire hash table. + */ +static void +PartCacheRelCallback(Datum arg, Oid relid) +{ + if (!OidIsValid(relid)) + { + HASH_SEQ_STATUS status; + PartdescCacheEntry *entry; + int nelems = 0; + + /* + * In case of a full relcache reset, we must reparent all entries to + * the current transaction context and flush the entire hash table. + */ + hash_seq_init(&status, PartCacheHash); + while ((entry = (PartdescCacheEntry *) hash_seq_search(&status)) != NULL) + { + MemoryContextSetParent(entry->partdesc->memcxt, + TopTransactionContext); + nelems++; + } + hash_destroy(PartCacheHash); + create_partcache_hashtab(nelems); + } + else + { + PartdescCacheEntry *entry; + bool found; + + /* + * For a single-relation inval message, search the hash table + * for that entry directly. + */ + entry = hash_search(PartCacheHash, (void *) &relid, + HASH_REMOVE, &found); + if (found) + MemoryContextSetParent(entry->partdesc->memcxt, + TopTransactionContext); + } +} + +/* + * BuildPartitionDesc + * Build and return the PartitionDesc for 'rel'. + * + * partrelids can be passed as a list of partitions that will be included in + * the descriptor; this is useful when the list of partitions is fixed in + * advance, for example when a parallel worker restores state from the parallel + * leader. If partrelids is NIL, then pg_inherits is scanned (with catalog + * snapshot) to determine the list of partitions. + * + * This function is supposed not to leak any memory. + */ +static PartitionDesc +BuildPartitionDesc(Relation rel, List *partrelids) { List *inhoids, *partoids; @@ -270,22 +393,45 @@ RelationBuildPartitionDesc(Relation rel) PartitionKey key = RelationGetPartitionKey(rel); PartitionDesc result; MemoryContext oldcxt; - int ndatums = 0; int default_index = -1; - - /* Hash partitioning specific */ PartitionHashBound **hbounds = NULL; - - /* List partitioning specific */ PartitionListValue **all_values = NULL; int null_index = -1; - - /* Range partitioning specific */ PartitionRangeBound **rbounds = NULL; + MemoryContext memcxt; - /* Get partition oids from pg_inherits */ - inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock); + /* + * Each partition descriptor has and is contained in its own memory + * context. We start by creating a child of the current memory context, + * and then set it as child of CacheMemoryContext if everything goes well, + * making the partdesc permanent. + */ + memcxt = AllocSetContextCreate(CurrentMemoryContext, + "partition descriptor", + ALLOCSET_SMALL_SIZES); + MemoryContextCopyAndSetIdentifier(memcxt, + RelationGetRelationName(rel)); + result = MemoryContextAllocZero(memcxt, sizeof(PartitionDescData)); + result->memcxt = memcxt; + + /* + * To guarantee no memory leaks in this function, we create a temporary + * memory context into which all our transient allocations go. This also + * enables us to run without pfree'ing anything; simply deleting this + * context at the end is enough. + */ + memcxt = AllocSetContextCreate(CurrentMemoryContext, + "partdesc temp", + ALLOCSET_SMALL_SIZES); + oldcxt = MemoryContextSwitchTo(memcxt); + + /* + * If caller passed an OID array, use that as the partition list; + * otherwise obtain a fresh one from pg_inherits. + */ + inhoids = partrelids != NIL ? partrelids : + find_inheritance_children(RelationGetRelid(rel), NoLock); /* Collect bound spec nodes in a list */ i = 0; @@ -302,11 +448,25 @@ RelationBuildPartitionDesc(Relation rel) if (!HeapTupleIsValid(tuple)) elog(ERROR, "cache lookup failed for relation %u", inhrelid); + /* + * If this partition doesn't have relpartbound set, it must have been + * recently detached. We can't cope with that; producing a partition + * descriptor without it might cause a crash if used with a plan + * containing partition prune info. Raise an error in this case. + * + * This should only happen when an ALTER TABLE DETACH PARTITION occurs + * between the leader process of a parallel query serializes the partition + * descriptor and the workers restore it, so it should be pretty + * uncommon anyway. + */ datum = SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound, &isnull); if (isnull) - elog(ERROR, "null relpartbound for relation %u", inhrelid); + ereport(ERROR, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("relation %u is no longer a partition", inhrelid))); + boundspec = (Node *) stringToNode(TextDatumGetCString(datum)); /* @@ -435,8 +595,8 @@ RelationBuildPartitionDesc(Relation rel) * Collect all list values in one array. Alongside the value, we * also save the index of partition the value comes from. */ - all_values = (PartitionListValue **) palloc(ndatums * - sizeof(PartitionListValue *)); + all_values = (PartitionListValue **) + palloc(ndatums * sizeof(PartitionListValue *)); i = 0; foreach(cell, non_null_values) { @@ -565,15 +725,12 @@ RelationBuildPartitionDesc(Relation rel) (int) key->strategy); } - /* Now build the actual relcache partition descriptor */ - rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext, - "partition descriptor", - ALLOCSET_DEFAULT_SIZES); - MemoryContextCopyAndSetIdentifier(rel->rd_pdcxt, RelationGetRelationName(rel)); + /* + * Everything allocated from here on is part of the PartitionDesc, so use + * the descriptor's own memory context. + */ + MemoryContextSwitchTo(result->memcxt); - oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt); - - result = (PartitionDescData *) palloc0(sizeof(PartitionDescData)); result->nparts = nparts; if (nparts > 0) { @@ -591,8 +748,8 @@ RelationBuildPartitionDesc(Relation rel) boundinfo->null_index = -1; boundinfo->datums = (Datum **) palloc0(ndatums * sizeof(Datum *)); - /* Initialize mapping array with invalid values */ - mapping = (int *) palloc(sizeof(int) * nparts); + /* Initialize temporary mapping array with invalid values */ + mapping = (int *) MemoryContextAlloc(memcxt, sizeof(int) * nparts); for (i = 0; i < nparts; i++) mapping[i] = -1; @@ -628,9 +785,7 @@ RelationBuildPartitionDesc(Relation rel) } mapping[hbounds[i]->index] = i; - pfree(hbounds[i]); } - pfree(hbounds); break; } @@ -771,11 +926,96 @@ RelationBuildPartitionDesc(Relation rel) */ for (i = 0; i < nparts; i++) result->oids[mapping[i]] = oids[i]; - pfree(mapping); } + MemoryContextDelete(memcxt); MemoryContextSwitchTo(oldcxt); - rel->rd_partdesc = result; + + /* Make the new entry permanent */ + MemoryContextSetParent(result->memcxt, CacheMemoryContext); + + return result; +} + +/* + * EstimatePartCacheEntrySpace + * Returns the size needed to store the given partition descriptor. + * + * We are exporting only required fields from the partition descriptor. + */ +Size +EstimatePartCacheEntrySpace(PartdescCacheEntry *pce) +{ + return sizeof(Oid) + sizeof(int) + sizeof(Oid) * pce->partdesc->nparts; +} + +/* + * SerializePartCacheEntry + * Dumps the serialized partition descriptor cache entry onto the + * memory location at start_address. The amount of memory used is + * returned. + */ +Size +SerializePartCacheEntry(PartdescCacheEntry *pce, char *start_address) +{ + Size offset; + + /* copy all required fields */ + memcpy(start_address, &pce->relid, sizeof(Oid)); + offset = sizeof(Oid); + memcpy(start_address + offset, &pce->partdesc->nparts, sizeof(int)); + offset += sizeof(int); + memcpy(start_address + offset, pce->partdesc->oids, + pce->partdesc->nparts * sizeof(Oid)); + offset += pce->partdesc->nparts * sizeof(Oid); + + return offset; +} + +/* + * RestorePartitionDescriptor + * Restore a serialized partition descriptor from the specified address. + * The amount of memory read is returned. + */ +Size +RestorePartdescCacheEntry(PartdescCacheEntry *pce, Oid relid, + char *start_address) +{ + Size offset = 0; + Relation rel; + int nparts; + Oid *oids; + List *oidlist = NIL; + + pce->relid = relid; + + memcpy(&nparts, start_address, sizeof(int)); + offset += sizeof(int); + + oids = palloc(nparts * sizeof(Oid)); + memcpy(oids, start_address + offset, nparts * sizeof(Oid)); + offset += nparts * sizeof(Oid); + for (int i = 0; i < nparts; i++) + oidlist = lappend_oid(oidlist, oids[i]); + + /* + * If the snapshot still contains in its cache a descriptor for a relation + * that was dropped, we cannot open it here anymore; ignore it. We cannot + * rely on the invalidation occurring at the time of relation drop, + * because we want to preserve entries across invalidations arriving for + * reasons other than drop. + */ + rel = try_relation_open(pce->relid, AccessShareLock); + if (rel) + { + pce->partdesc = BuildPartitionDesc(rel, oidlist); + relation_close(rel, NoLock); + } + + pfree(oids); + list_free(oidlist); + + return offset; } /* @@ -962,3 +1202,19 @@ qsort_partition_rbound_cmp(const void *a, const void *b, void *arg) key->partcollation, b1->datums, b1->kind, b1->lower, b2); } + +/* + * Auxiliary function to create the hash table containing the partition + * descriptor cache. + */ +static void +create_partcache_hashtab(int nelems) +{ + HASHCTL ctl; + + MemSet(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(PartdescCacheEntry); + PartCacheHash = hash_create("Partition Descriptors", nelems, &ctl, + HASH_ELEM | HASH_BLOBS); +} diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index fd3d010b77..1a16a2b7a2 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -288,8 +288,6 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid, StrategyNumber numSupport); static void RelationCacheInitFileRemoveInDir(const char *tblspcpath); static void unlink_initfile(const char *initfilename, int elevel); -static bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1, - PartitionDesc partdesc2); /* @@ -1003,60 +1001,6 @@ equalRSDesc(RowSecurityDesc *rsdesc1, RowSecurityDesc *rsdesc2) } /* - * equalPartitionDescs - * Compare two partition descriptors for logical equality - */ -static bool -equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1, - PartitionDesc partdesc2) -{ - int i; - - if (partdesc1 != NULL) - { - if (partdesc2 == NULL) - return false; - if (partdesc1->nparts != partdesc2->nparts) - return false; - - Assert(key != NULL || partdesc1->nparts == 0); - - /* - * Same oids? If the partitioning structure did not change, that is, - * no partitions were added or removed to the relation, the oids array - * should still match element-by-element. - */ - for (i = 0; i < partdesc1->nparts; i++) - { - if (partdesc1->oids[i] != partdesc2->oids[i]) - return false; - } - - /* - * Now compare partition bound collections. The logic to iterate over - * the collections is private to partition.c. - */ - if (partdesc1->boundinfo != NULL) - { - if (partdesc2->boundinfo == NULL) - return false; - - if (!partition_bounds_equal(key->partnatts, key->parttyplen, - key->parttypbyval, - partdesc1->boundinfo, - partdesc2->boundinfo)) - return false; - } - else if (partdesc2->boundinfo != NULL) - return false; - } - else if (partdesc2 != NULL) - return false; - - return true; -} - -/* * RelationBuildDesc * * Build a relation descriptor. The caller must hold at least @@ -1184,18 +1128,13 @@ RelationBuildDesc(Oid targetRelId, bool insertIt) relation->rd_fkeylist = NIL; relation->rd_fkeyvalid = false; - /* if a partitioned table, initialize key and partition descriptor info */ + /* if a partitioned table, initialize key info */ if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - { RelationBuildPartitionKey(relation); - RelationBuildPartitionDesc(relation); - } else { relation->rd_partkeycxt = NULL; relation->rd_partkey = NULL; - relation->rd_partdesc = NULL; - relation->rd_pdcxt = NULL; } /* @@ -2284,8 +2223,6 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc) MemoryContextDelete(relation->rd_rsdesc->rscxt); if (relation->rd_partkeycxt) MemoryContextDelete(relation->rd_partkeycxt); - if (relation->rd_pdcxt) - MemoryContextDelete(relation->rd_pdcxt); if (relation->rd_partcheck) pfree(relation->rd_partcheck); if (relation->rd_fdwroutine) @@ -2440,7 +2377,6 @@ RelationClearRelation(Relation relation, bool rebuild) bool keep_rules; bool keep_policies; bool keep_partkey; - bool keep_partdesc; /* Build temporary entry, but don't link it into hashtable */ newrel = RelationBuildDesc(save_relid, false); @@ -2473,9 +2409,6 @@ RelationClearRelation(Relation relation, bool rebuild) keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc); /* partkey is immutable once set up, so we can always keep it */ keep_partkey = (relation->rd_partkey != NULL); - keep_partdesc = equalPartitionDescs(relation->rd_partkey, - relation->rd_partdesc, - newrel->rd_partdesc); /* * Perform swapping of the relcache entry contents. Within this @@ -2536,11 +2469,6 @@ RelationClearRelation(Relation relation, bool rebuild) SWAPFIELD(PartitionKey, rd_partkey); SWAPFIELD(MemoryContext, rd_partkeycxt); } - if (keep_partdesc) - { - SWAPFIELD(PartitionDesc, rd_partdesc); - SWAPFIELD(MemoryContext, rd_pdcxt); - } #undef SWAPFIELD @@ -3776,7 +3704,7 @@ RelationCacheInitializePhase3(void) } /* - * Reload the partition key and descriptor for a partitioned table. + * Reload the partition key for a partitioned table. */ if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE && relation->rd_partkey == NULL) @@ -3787,15 +3715,6 @@ RelationCacheInitializePhase3(void) restart = true; } - if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE && - relation->rd_partdesc == NULL) - { - RelationBuildPartitionDesc(relation); - Assert(relation->rd_partdesc != NULL); - - restart = true; - } - /* Release hold on the relation */ RelationDecrementReferenceCount(relation); @@ -5652,8 +5571,6 @@ load_relcache_init_file(bool shared) rel->rd_rsdesc = NULL; rel->rd_partkeycxt = NULL; rel->rd_partkey = NULL; - rel->rd_pdcxt = NULL; - rel->rd_partdesc = NULL; rel->rd_partcheck = NIL; rel->rd_indexprs = NIL; rel->rd_indpred = NIL; diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index edf59efc29..72a82f05cc 100644 --- a/src/backend/utils/time/snapmgr.c +++ b/src/backend/utils/time/snapmgr.c @@ -64,6 +64,7 @@ #include "utils/memutils.h" #include "utils/rel.h" #include "utils/resowner_private.h" +#include "utils/partcache.h" #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/tqual.h" @@ -128,6 +129,11 @@ typedef struct OldSnapshotControlData static volatile OldSnapshotControlData *oldSnapshotControl; +typedef struct SnapshotPartitionDescriptors +{ + HTAB *hashtab; + int refcount; +} SnapshotPartitionDescriptors; /* * CurrentSnapshot points to the only snapshot taken in transaction-snapshot @@ -228,6 +234,11 @@ static Snapshot CopySnapshot(Snapshot snapshot); static void FreeSnapshot(Snapshot snapshot); static void SnapshotResetXmin(void); +static void init_partition_descriptors(Snapshot snapshot); +static void create_partdesc_hashtab(Snapshot snapshot); +static void pin_partition_descriptors(Snapshot snapshot); +static void unpin_partition_descriptors(Snapshot snapshot); + /* * Snapshot fields to be serialized. * @@ -245,6 +256,7 @@ typedef struct SerializedSnapshotData CommandId curcid; TimestampTz whenTaken; XLogRecPtr lsn; + int npartdescs; /* XXX move? */ } SerializedSnapshotData; Size @@ -355,6 +367,9 @@ GetTransactionSnapshot(void) else CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); + init_partition_descriptors(CurrentSnapshot); + pin_partition_descriptors(CurrentSnapshot); + FirstSnapshotSet = true; return CurrentSnapshot; } @@ -367,6 +382,8 @@ GetTransactionSnapshot(void) CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); + init_partition_descriptors(CurrentSnapshot); + return CurrentSnapshot; } @@ -396,7 +413,16 @@ GetLatestSnapshot(void) if (!FirstSnapshotSet) return GetTransactionSnapshot(); + /* + * If we have a partition descriptor cache from a previous iteration, + * clean it up + */ + if (SecondarySnapshot) + unpin_partition_descriptors(SecondarySnapshot); + SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData); + init_partition_descriptors(SecondarySnapshot); + pin_partition_descriptors(SecondarySnapshot); return SecondarySnapshot; } @@ -678,6 +704,13 @@ CopySnapshot(Snapshot snapshot) newsnap->active_count = 0; newsnap->copied = true; + /* + * All copies of a snapshot share the same partition descriptor cache; we + * must not free it until all references to it are gone. Caller must see + * to it that the descriptor is pinned! + */ + newsnap->partdescs = snapshot->partdescs; + /* setup XID array */ if (snapshot->xcnt > 0) { @@ -718,6 +751,9 @@ FreeSnapshot(Snapshot snapshot) Assert(snapshot->active_count == 0); Assert(snapshot->copied); + if (snapshot->partdescs) + unpin_partition_descriptors(snapshot); + pfree(snapshot); } @@ -747,6 +783,8 @@ PushActiveSnapshot(Snapshot snap) else newactive->as_snap = snap; + pin_partition_descriptors(newactive->as_snap); + newactive->as_next = ActiveSnapshot; newactive->as_level = GetCurrentTransactionNestLevel(); @@ -891,6 +929,8 @@ RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner) if (snap->regd_count == 1) pairingheap_add(&RegisteredSnapshots, &snap->ph_node); + pin_partition_descriptors(snap); + return snap; } @@ -1075,6 +1115,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin) } FirstXactSnapshot = NULL; + /* FIXME what do we need for partdescs here?? */ + /* * If we exported any snapshots, clean them up. */ @@ -2056,6 +2098,20 @@ EstimateSnapshotSpace(Snapshot snap) size = add_size(size, mul_size(snap->subxcnt, sizeof(TransactionId))); + if (snap->partdescs->hashtab) + { + HASH_SEQ_STATUS status; + void *entry; + + size = add_size(size, sizeof(int)); + + hash_seq_init(&status, snap->partdescs->hashtab); + while ((entry = hash_seq_search(&status)) != NULL) + { + size = add_size(size, EstimatePartCacheEntrySpace(entry)); + } + } + return size; } @@ -2068,9 +2124,21 @@ void SerializeSnapshot(Snapshot snapshot, char *start_address) { SerializedSnapshotData serialized_snapshot; + int numpartdescs = 0; Assert(snapshot->subxcnt >= 0); + /* Count entries in local partition descriptor cache, if there's one */ + if (snapshot->partdescs->hashtab) + { + HASH_SEQ_STATUS status; + void *entry; + + hash_seq_init(&status, snapshot->partdescs->hashtab); + while ((entry = hash_seq_search(&status)) != NULL) + numpartdescs++; + } + /* Copy all required fields */ serialized_snapshot.xmin = snapshot->xmin; serialized_snapshot.xmax = snapshot->xmax; @@ -2081,6 +2149,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address) serialized_snapshot.curcid = snapshot->curcid; serialized_snapshot.whenTaken = snapshot->whenTaken; serialized_snapshot.lsn = snapshot->lsn; + serialized_snapshot.npartdescs = numpartdescs; /* * Ignore the SubXID array if it has overflowed, unless the snapshot was @@ -2114,6 +2183,25 @@ SerializeSnapshot(Snapshot snapshot, char *start_address) memcpy((TransactionId *) (start_address + subxipoff), snapshot->subxip, snapshot->subxcnt * sizeof(TransactionId)); } + + /* Serialize each cached partition descriptor. */ + if (numpartdescs > 0) + { + HASH_SEQ_STATUS status; + Size partdescoff; + void *entry; + + partdescoff = sizeof(SerializedSnapshotData) + + snapshot->xcnt * sizeof(TransactionId) + + serialized_snapshot.subxcnt * sizeof(TransactionId); + + hash_seq_init(&status, snapshot->partdescs->hashtab); + while ((entry = hash_seq_search(&status)) != NULL) + { + partdescoff += + SerializePartCacheEntry(entry, start_address + partdescoff); + } + } } /* @@ -2156,6 +2244,8 @@ RestoreSnapshot(char *start_address) snapshot->whenTaken = serialized_snapshot.whenTaken; snapshot->lsn = serialized_snapshot.lsn; + snapshot->partdescs = NULL; + /* Copy XIDs, if present. */ if (serialized_snapshot.xcnt > 0) { @@ -2173,6 +2263,31 @@ RestoreSnapshot(char *start_address) serialized_snapshot.subxcnt * sizeof(TransactionId)); } + if (serialized_snapshot.npartdescs > 0) + { + char *address = start_address + sizeof(SerializedSnapshotData) + + serialized_snapshot.xcnt * sizeof(TransactionId) + + serialized_snapshot.subxcnt * sizeof(TransactionId); + + init_partition_descriptors(snapshot); + create_partdesc_hashtab(snapshot); + pin_partition_descriptors(snapshot); /* XXX is this needed? */ + + for (int i = 0; i < serialized_snapshot.npartdescs; i++) + { + Oid relid; + PartdescCacheEntry *entry; + + memcpy(&relid, address, sizeof(Oid)); + address += sizeof(Oid); + + entry = hash_search(snapshot->partdescs->hashtab, &relid, + HASH_ENTER, NULL); + + address += RestorePartdescCacheEntry(entry, relid, address); + } + } + /* Set the copied flag so that the caller will set refcounts correctly. */ snapshot->regd_count = 0; snapshot->active_count = 0; @@ -2192,3 +2307,103 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *master_pgproc) { SetTransactionSnapshot(snapshot, NULL, InvalidPid, master_pgproc); } + +/*--------------------------------------------------------------------- + * Partition descriptor cache support + *--------------------------------------------------------------------- + */ + +/* + * SnapshotGetPartitionDesc + * Return a partition descriptor valid for the given snapshot. + * + * If the partition descriptor has already been cached for this snapshot, + * return that; otherwise, partcache.c does the actual work. + */ +PartitionDesc +SnapshotGetPartitionDesc(Snapshot snapshot, Relation rel) +{ + PartdescCacheEntry *entry; + Oid relid = RelationGetRelid(rel); + bool found; + + Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); + + /* Initialize hash table on first call */ + if (snapshot->partdescs->hashtab == NULL) + create_partdesc_hashtab(snapshot); + + /* Search hash table, initializing new entry if not found */ + entry = hash_search(snapshot->partdescs->hashtab, &relid, + HASH_ENTER, &found); + if (!found) + entry->partdesc = lookup_partdesc_cache(rel); + + return entry->partdesc; +} + +/* + * Initialize the partition descriptor struct for this snapshot. + */ +static void +init_partition_descriptors(Snapshot snapshot) +{ + SnapshotPartitionDescriptors *descs; + + descs = MemoryContextAlloc(TopTransactionContext, + sizeof(SnapshotPartitionDescriptors)); + descs->hashtab = NULL; + descs->refcount = 0; + + snapshot->partdescs = descs; +} + +/* + * Create the hashtable for the partition descriptor cache of this snapshot. + * + * We do this separately from initializing, to delay until the hashtable is + * really needed. Many snapshots will never access a partitioned table. + */ +static void +create_partdesc_hashtab(Snapshot snapshot) +{ + HASHCTL ctl; + + MemSet(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(PartdescCacheEntry); + ctl.hcxt = TopTransactionContext; + snapshot->partdescs->hashtab = + hash_create("Snapshot Partdescs", 10, &ctl, + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); +} + +/* + * Increment pin count for this snapshot's partition descriptor. + */ +static void +pin_partition_descriptors(Snapshot snapshot) +{ + if (snapshot->partdescs) + snapshot->partdescs->refcount++; +} + +/* + * Decrement pin count for this snapshot's partition descriptor. + * + * If this was the last snapshot using this partition descriptor, free it. + */ +static void +unpin_partition_descriptors(Snapshot snapshot) +{ + /* Quick exit for snapshots without partition descriptors */ + if (!snapshot->partdescs) + return; + + if (--snapshot->partdescs->refcount <= 0) + { + hash_destroy(snapshot->partdescs->hashtab); + pfree(snapshot->partdescs); + snapshot->partdescs = NULL; + } +} diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h index a53de2372e..74904d6285 100644 --- a/src/include/catalog/partition.h +++ b/src/include/catalog/partition.h @@ -19,15 +19,6 @@ /* Seed for the extended hash function */ #define HASH_PARTITION_SEED UINT64CONST(0x7A5B22367996DCFD) -/* - * Information about partitions of a partitioned table. - */ -typedef struct PartitionDescData -{ - int nparts; /* Number of partitions */ - Oid *oids; /* OIDs of partitions */ - PartitionBoundInfo boundinfo; /* collection of partition bounds */ -} PartitionDescData; extern Oid get_partition_parent(Oid relid); extern List *get_partition_ancestors(Oid relid); diff --git a/src/include/utils/partcache.h b/src/include/utils/partcache.h index 873c60fafd..b48bcd33fe 100644 --- a/src/include/utils/partcache.h +++ b/src/include/utils/partcache.h @@ -46,11 +46,35 @@ typedef struct PartitionKeyData Oid *parttypcoll; } PartitionKeyData; +/* + * Information about partitions of a partitioned table. + */ +typedef struct PartitionDescData +{ + Oid relid; /* hash key -- must be first */ + int nparts; /* Number of partitions */ + Oid *oids; /* OIDs of partitions */ + PartitionBoundInfo boundinfo; /* collection of partition bounds */ + MemoryContext memcxt; /* memory context containing this entry */ +} PartitionDescData; + +typedef struct PartdescCacheEntry +{ + Oid relid; + PartitionDesc partdesc; +} PartdescCacheEntry; + +extern PartitionDesc lookup_partdesc_cache(Relation partedrel); + extern void RelationBuildPartitionKey(Relation relation); -extern void RelationBuildPartitionDesc(Relation rel); extern List *RelationGetPartitionQual(Relation rel); extern Expr *get_partition_qual_relid(Oid relid); +extern Size EstimatePartCacheEntrySpace(PartdescCacheEntry *pce); +extern Size SerializePartCacheEntry(PartdescCacheEntry *pce, char *start_address); +extern Size RestorePartdescCacheEntry(PartdescCacheEntry *pce, Oid relid, + char *start_ddress); + /* * PartitionKey inquiry functions */ diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index 84469f5715..97cfd0f4d0 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -97,8 +97,6 @@ typedef struct RelationData MemoryContext rd_partkeycxt; /* private memory cxt for the below */ struct PartitionKeyData *rd_partkey; /* partition key, or NULL */ - MemoryContext rd_pdcxt; /* private context for partdesc */ - struct PartitionDescData *rd_partdesc; /* partitions, or NULL */ List *rd_partcheck; /* partition CHECK quals */ /* data managed by RelationGetIndexList: */ @@ -589,12 +587,6 @@ typedef struct ViewOptions */ #define RelationGetPartitionKey(relation) ((relation)->rd_partkey) -/* - * RelationGetPartitionDesc - * Returns partition descriptor for a relation. - */ -#define RelationGetPartitionDesc(relation) ((relation)->rd_partdesc) - /* routines in utils/cache/relcache.c */ extern void RelationIncrementReferenceCount(Relation rel); extern void RelationDecrementReferenceCount(Relation rel); diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h index 83806f3040..443c4532c4 100644 --- a/src/include/utils/snapmgr.h +++ b/src/include/utils/snapmgr.h @@ -14,6 +14,7 @@ #define SNAPMGR_H #include "fmgr.h" +#include "partitioning/partdefs.h" #include "utils/relcache.h" #include "utils/resowner.h" #include "utils/snapshot.h" @@ -83,6 +84,8 @@ extern void UnregisterSnapshot(Snapshot snapshot); extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner); extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner); +extern PartitionDesc SnapshotGetPartitionDesc(Snapshot snapshot, Relation rel); + extern void AtSubCommit_Snapshot(int level); extern void AtSubAbort_Snapshot(int level); extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin); diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h index a8a5a8f4c0..d99bacd8b6 100644 --- a/src/include/utils/snapshot.h +++ b/src/include/utils/snapshot.h @@ -34,6 +34,12 @@ typedef bool (*SnapshotSatisfiesFunc) (HeapTuple htup, Snapshot snapshot, Buffer buffer); /* + * Partition descriptors cached by the snapshot. Opaque to outside callers; + * use SnapshotGetPartitionDesc(). + */ +struct SnapshotPartitionDescriptors; + +/* * Struct representing all kind of possible snapshots. * * There are several different kinds of snapshots: @@ -103,6 +109,9 @@ typedef struct SnapshotData */ uint32 speculativeToken; + /* cached partitioned table descriptors */ + struct SnapshotPartitionDescriptors *partdescs; + /* * Book-keeping information, used by the snapshot manager */ diff --git a/src/test/isolation/expected/attach-partition-1.out b/src/test/isolation/expected/attach-partition-1.out new file mode 100644 index 0000000000..3a5a5b6422 --- /dev/null +++ b/src/test/isolation/expected/attach-partition-1.out @@ -0,0 +1,31 @@ +Parsed test spec with 3 sessions + +starting permutation: s1b s1s s2a s1s s3b s3s s1c s1s s3s s3c +step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1s: SELECT * FROM listp; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1s: SELECT * FROM listp; +a + +1 +step s3b: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s3s: SELECT * FROM listp; +a + +1 +2 +step s1c: COMMIT; +step s1s: SELECT * FROM listp; +a + +1 +2 +step s3s: SELECT * FROM listp; +a + +1 +2 +step s3c: COMMIT; diff --git a/src/test/isolation/expected/attach-partition-2.out b/src/test/isolation/expected/attach-partition-2.out new file mode 100644 index 0000000000..c4090ceb0d --- /dev/null +++ b/src/test/isolation/expected/attach-partition-2.out @@ -0,0 +1,238 @@ +Parsed test spec with 2 sessions + +starting permutation: s1brc s1prep s1exec s2a s1exec s1c s1exec +step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1exec: EXECUTE f; +a + +1 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1brc s1prep s1exec s2a s1dummy s1exec s1c s1exec +step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1dummy: SELECT 1; +?column? + +1 +step s1exec: EXECUTE f; +a + +1 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1brc s1prep s1exec s2a s1dummy2 s1exec s1c s1exec +step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1dummy2: SELECT 1 + 1; +?column? + +2 +step s1exec: EXECUTE f; +a + +1 +2 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1brc s1prep s1exec s2a s1ins s1exec s1c s1exec +step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1ins: INSERT INTO listp VALUES (1); +step s1exec: EXECUTE f; +a + +1 +1 +2 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +1 +2 + +starting permutation: s1brr s1prep s1exec s2a s1exec s1c s1exec +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1exec: EXECUTE f; +a + +1 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1brr s1prep s1exec s2a s1dummy s1exec s1c s1exec +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1dummy: SELECT 1; +?column? + +1 +step s1exec: EXECUTE f; +a + +1 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1brr s1prep s1exec s2a s1dummy2 s1exec s1c s1exec +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1dummy2: SELECT 1 + 1; +?column? + +2 +step s1exec: EXECUTE f; +a + +1 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1brr s1prep s1exec s2a s1ins s1exec s1c s1exec +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1ins: INSERT INTO listp VALUES (1); +step s1exec: EXECUTE f; +a + +1 +1 +step s1c: COMMIT; +step s1exec: EXECUTE f; +a + +1 +1 + +starting permutation: s1prep s1exec s2a s1exec +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1prep s1exec s2a s1dummy s1exec +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1dummy: SELECT 1; +?column? + +1 +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1prep s1exec s2a s1dummy2 s1exec +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1dummy2: SELECT 1 + 1; +?column? + +2 +step s1exec: EXECUTE f; +a + +1 +2 + +starting permutation: s1prep s1exec s2a s1ins s1exec +step s1prep: PREPARE f AS SELECT * FROM listp ; +step s1exec: EXECUTE f; +a + +1 +step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); +step s1ins: INSERT INTO listp VALUES (1); +step s1exec: EXECUTE f; +a + +1 +1 +2 diff --git a/src/test/isolation/expected/detach-partition-1.out b/src/test/isolation/expected/detach-partition-1.out new file mode 100644 index 0000000000..b14d9f1018 --- /dev/null +++ b/src/test/isolation/expected/detach-partition-1.out @@ -0,0 +1,42 @@ +Parsed test spec with 2 sessions + +starting permutation: s1brr s1s s2d s1s s2drop s1c s1s +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1s: SELECT * FROM d_listp; +a + +1 +2 +step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2; +step s1s: SELECT * FROM d_listp; +a + +1 +2 +step s2drop: DROP TABLE d_listp2; <waiting ...> +step s1c: COMMIT; +step s2drop: <... completed> +step s1s: SELECT * FROM d_listp; +a + +1 + +starting permutation: s1brc s1s s2d s1s s2drop s1c s1s +step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED; +step s1s: SELECT * FROM d_listp; +a + +1 +2 +step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2; +step s1s: SELECT * FROM d_listp; +a + +1 +step s2drop: DROP TABLE d_listp2; <waiting ...> +step s1c: COMMIT; +step s2drop: <... completed> +step s1s: SELECT * FROM d_listp; +a + +1 diff --git a/src/test/isolation/expected/detach-partition-2.out b/src/test/isolation/expected/detach-partition-2.out new file mode 100644 index 0000000000..8c1e828c5f --- /dev/null +++ b/src/test/isolation/expected/detach-partition-2.out @@ -0,0 +1,37 @@ +Parsed test spec with 2 sessions + +starting permutation: s1brr s1dec s1fetch s2d s1fetch s2drop s1c +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1dec: DECLARE f NO SCROLL CURSOR FOR SELECT * FROM d_listp; +step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f; +a + +1 +2 +step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2; +step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f; +a + +1 +2 +step s2drop: DROP TABLE d_listp2; <waiting ...> +step s1c: COMMIT; +step s2drop: <... completed> + +starting permutation: s1brc s1dec s1fetch s2d s1fetch s2drop s1c +step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED; +step s1dec: DECLARE f NO SCROLL CURSOR FOR SELECT * FROM d_listp; +step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f; +a + +1 +2 +step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2; +step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f; +a + +1 +2 +step s2drop: DROP TABLE d_listp2; <waiting ...> +step s1c: COMMIT; +step s2drop: <... completed> diff --git a/src/test/isolation/expected/detach-partition-3.out b/src/test/isolation/expected/detach-partition-3.out new file mode 100644 index 0000000000..cb775b8f97 --- /dev/null +++ b/src/test/isolation/expected/detach-partition-3.out @@ -0,0 +1,45 @@ +Parsed test spec with 2 sessions + +starting permutation: s1brr s1prep s1exec s2d s1exec s2drop s1c s1exec +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1prep: PREPARE f AS SELECT * FROM dp_listp; +step s1exec: EXECUTE f; +a + +1 +2 +step s2d: ALTER TABLE dp_listp DETACH PARTITION dp_listp2; +step s1exec: EXECUTE f; +a + +1 +2 +step s2drop: DROP TABLE dp_listp2; <waiting ...> +step s1c: COMMIT; +step s2drop: <... completed> +step s1exec: EXECUTE f; +a + +1 + +starting permutation: s1brc s1prep s1exec s2d s1exec s2drop s1c s1exec +step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED; +step s1prep: PREPARE f AS SELECT * FROM dp_listp; +step s1exec: EXECUTE f; +a + +1 +2 +step s2d: ALTER TABLE dp_listp DETACH PARTITION dp_listp2; +step s1exec: EXECUTE f; +a + +1 +2 +step s2drop: DROP TABLE dp_listp2; <waiting ...> +step s1c: COMMIT; +step s2drop: <... completed> +step s1exec: EXECUTE f; +a + +1 diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule index dd57a96e78..da72d31507 100644 --- a/src/test/isolation/isolation_schedule +++ b/src/test/isolation/isolation_schedule @@ -77,5 +77,10 @@ test: partition-key-update-1 test: partition-key-update-2 test: partition-key-update-3 test: partition-key-update-4 +test: attach-partition-1 +test: attach-partition-2 +test: detach-partition-1 +test: detach-partition-2 +test: detach-partition-3 test: plpgsql-toast test: truncate-conflict diff --git a/src/test/isolation/specs/attach-partition-1.spec b/src/test/isolation/specs/attach-partition-1.spec new file mode 100644 index 0000000000..4d8af76d92 --- /dev/null +++ b/src/test/isolation/specs/attach-partition-1.spec @@ -0,0 +1,30 @@ +# Test that attach partition concurrently makes the partition visible at the +# correct time. + +setup +{ + CREATE TABLE listp (a int) PARTITION BY LIST(a); + CREATE TABLE listp1 PARTITION OF listp FOR VALUES IN (1); + CREATE TABLE listp2 (a int); + INSERT INTO listp1 VALUES (1); + INSERT INTO listp2 VALUES (2); +} + +teardown { DROP TABLE listp; } + +session "s1" +step "s1b" { BEGIN ISOLATION LEVEL REPEATABLE READ; } +step "s1s" { SELECT * FROM listp; } +step "s1c" { COMMIT; } + +session "s2" +step "s2a" { ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); } + +session "s3" +step "s3b" { BEGIN ISOLATION LEVEL REPEATABLE READ; } +step "s3s" { SELECT * FROM listp; } +step "s3c" { COMMIT; } + +# listp2's row should not be visible to s1 until transaction commit. +# session 3 should see list2's row with both SELECTs it performs. +permutation "s1b" "s1s" "s2a" "s1s" "s3b" "s3s" "s1c" "s1s" "s3s" "s3c" diff --git a/src/test/isolation/specs/attach-partition-2.spec b/src/test/isolation/specs/attach-partition-2.spec new file mode 100644 index 0000000000..c6a7de8801 --- /dev/null +++ b/src/test/isolation/specs/attach-partition-2.spec @@ -0,0 +1,42 @@ +setup +{ + CREATE TABLE listp (a int) PARTITION BY LIST(a); + CREATE TABLE listp1 PARTITION OF listp FOR VALUES IN (1); + CREATE TABLE listp2 (a int); + INSERT INTO listp1 VALUES (1); + INSERT INTO listp2 VALUES (2); +} + +teardown { DROP TABLE listp; } + +session "s1" +step "s1brc" { BEGIN ISOLATION LEVEL READ COMMITTED; } +step "s1brr" { BEGIN ISOLATION LEVEL REPEATABLE READ; } +step "s1prep" { PREPARE f AS SELECT * FROM listp ; } +step "s1exec" { EXECUTE f; } +step "s1ins" { INSERT INTO listp VALUES (1); } +step "s1dummy" { SELECT 1; } +step "s1dummy2" { SELECT 1 + 1; } +step "s1c" { COMMIT; } +teardown { DEALLOCATE f; } + +session "s2" +step "s2a"{ ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); } + +# read committed +permutation "s1brc" "s1prep" "s1exec" "s2a" "s1exec" "s1c" "s1exec" +permutation "s1brc" "s1prep" "s1exec" "s2a" "s1dummy" "s1exec" "s1c" "s1exec" +permutation "s1brc" "s1prep" "s1exec" "s2a" "s1dummy2" "s1exec" "s1c" "s1exec" +permutation "s1brc" "s1prep" "s1exec" "s2a" "s1ins" "s1exec" "s1c" "s1exec" + +# repeatable read +permutation "s1brr" "s1prep" "s1exec" "s2a" "s1exec" "s1c" "s1exec" +permutation "s1brr" "s1prep" "s1exec" "s2a" "s1dummy" "s1exec" "s1c" "s1exec" +permutation "s1brr" "s1prep" "s1exec" "s2a" "s1dummy2" "s1exec" "s1c" "s1exec" +permutation "s1brr" "s1prep" "s1exec" "s2a" "s1ins" "s1exec" "s1c" "s1exec" + +# no transaction +permutation "s1prep" "s1exec" "s2a" "s1exec" +permutation "s1prep" "s1exec" "s2a" "s1dummy" "s1exec" +permutation "s1prep" "s1exec" "s2a" "s1dummy2" "s1exec" +permutation "s1prep" "s1exec" "s2a" "s1ins" "s1exec" diff --git a/src/test/isolation/specs/detach-partition-1.spec b/src/test/isolation/specs/detach-partition-1.spec new file mode 100644 index 0000000000..8f18853948 --- /dev/null +++ b/src/test/isolation/specs/detach-partition-1.spec @@ -0,0 +1,31 @@ +# Test that detach partition concurrently makes the partition invisible at the +# correct time. + +setup +{ + CREATE TABLE d_listp (a int) PARTITION BY LIST(a); + CREATE TABLE d_listp1 PARTITION OF d_listp FOR VALUES IN (1); + CREATE TABLE d_listp2 PARTITION OF d_listp FOR VALUES IN (2); + INSERT INTO d_listp VALUES (1),(2); +} + +teardown { DROP TABLE IF EXISTS d_listp, d_listp2; } + +session "s1" +step "s1brr" { BEGIN ISOLATION LEVEL REPEATABLE READ; } +step "s1brc" { BEGIN ISOLATION LEVEL READ COMMITTED; } +step "s1s" { SELECT * FROM d_listp; } +step "s1c" { COMMIT; } + +session "s2" +step "s2d" { ALTER TABLE d_listp DETACH PARTITION d_listp2; } +step "s2drop" { DROP TABLE d_listp2; } + +# In repeatable-read isolation level, listp2's row should always be visible to +# s1 until transaction commit. Also, s2 cannot drop the detached partition +# until s1 has closed its transaction. +permutation "s1brr" "s1s" "s2d" "s1s" "s2drop" "s1c" "s1s" + +# In read-committed isolation level, the partition "disappears" immediately +# from view. However, the DROP still has to wait for s1's commit. +permutation "s1brc" "s1s" "s2d" "s1s" "s2drop" "s1c" "s1s" diff --git a/src/test/isolation/specs/detach-partition-2.spec b/src/test/isolation/specs/detach-partition-2.spec new file mode 100644 index 0000000000..24035276a8 --- /dev/null +++ b/src/test/isolation/specs/detach-partition-2.spec @@ -0,0 +1,32 @@ +# Test that detach partition concurrently makes the partition invisible at the +# correct time. + +setup +{ + CREATE TABLE d_listp (a int) PARTITION BY LIST(a); + CREATE TABLE d_listp1 PARTITION OF d_listp FOR VALUES IN (1); + CREATE TABLE d_listp2 PARTITION OF d_listp FOR VALUES IN (2); + INSERT INTO d_listp VALUES (1),(2); +} + +teardown { DROP TABLE IF EXISTS d_listp, d_listp2; } + +session "s1" +step "s1brr" { BEGIN ISOLATION LEVEL REPEATABLE READ; } +step "s1brc" { BEGIN ISOLATION LEVEL READ COMMITTED; } +step "s1dec" { DECLARE f NO SCROLL CURSOR FOR SELECT * FROM d_listp; } +step "s1fetch" { FETCH ALL FROM f; MOVE ABSOLUTE 0 f; } +step "s1c" { COMMIT; } + +session "s2" +step "s2d" { ALTER TABLE d_listp DETACH PARTITION d_listp2; } +step "s2drop" { DROP TABLE d_listp2; } + +# In repeatable-read isolation level, listp2's row should always be visible to +# s1 until transaction commit. Also, s2 cannot drop the detached partition +# until s1 has closed its transaction. +permutation "s1brr" "s1dec" "s1fetch" "s2d" "s1fetch" "s2drop" "s1c" + +# In read-committed isolation level, the partition "disappears" immediately +# from view. However, the DROP still has to wait for s1's commit. +permutation "s1brc" "s1dec" "s1fetch" "s2d" "s1fetch" "s2drop" "s1c" diff --git a/src/test/isolation/specs/detach-partition-3.spec b/src/test/isolation/specs/detach-partition-3.spec new file mode 100644 index 0000000000..5410f92d31 --- /dev/null +++ b/src/test/isolation/specs/detach-partition-3.spec @@ -0,0 +1,33 @@ +# Test that detach partition concurrently makes the partition invisible at the +# correct time. + +setup +{ + CREATE TABLE dp_listp (a int) PARTITION BY LIST(a); + CREATE TABLE dp_listp1 PARTITION OF dp_listp FOR VALUES IN (1); + CREATE TABLE dp_listp2 PARTITION OF dp_listp FOR VALUES IN (2); + INSERT INTO dp_listp VALUES (1),(2); +} + +teardown { DROP TABLE IF EXISTS dp_listp, dp_listp2; } + +session "s1" +step "s1brr" { BEGIN ISOLATION LEVEL REPEATABLE READ; } +step "s1brc" { BEGIN ISOLATION LEVEL READ COMMITTED; } +step "s1prep" { PREPARE f AS SELECT * FROM dp_listp; } +step "s1exec" { EXECUTE f; } +step "s1c" { COMMIT; } +teardown { DEALLOCATE f; } + +session "s2" +step "s2d" { ALTER TABLE dp_listp DETACH PARTITION dp_listp2; } +step "s2drop" { DROP TABLE dp_listp2; } + +# In repeatable-read isolation level, listp2's row should always be visible to +# s1 until transaction commit. Also, s2 cannot drop the detached partition +# until s1 has closed its transaction. +permutation "s1brr" "s1prep" "s1exec" "s2d" "s1exec" "s2drop" "s1c" "s1exec" + +# In read-committed isolation level, the partition "disappears" immediately +# from view. However, the DROP still has to wait for s1's commit. +permutation "s1brc" "s1prep" "s1exec" "s2d" "s1exec" "s2drop" "s1c" "s1exec"