On 2017/08/26 3:28, Robert Haas wrote: > On Mon, Aug 21, 2017 at 2:10 AM, Amit Langote > <langote_amit...@lab.ntt.co.jp> wrote: >> [ new patches ] > > I am failing to understand the point of separating PartitionDispatch > into PartitionDispatch and PartitionTableInfo. That seems like an > unnecessary multiplication of entities, as does the introduction of > PartitionKeyInfo. I also think that replacing reldesc with reloid is > not really an improvement; any places that gets the relid has to go > open the relation to get the reldesc, whereas without that it has a > direct pointer to the information it needs.
I am worried about the open relcache reference in PartitionDispatch when we start using it in the planner. Whereas there is a ExecEndModifyTable() as a suitable place to close that reference, there doesn't seem to exist one within the planner, but I guess we will have to figure something out. For time being, the second patch closes the same in expand_inherited_rtentry() right after picking up the OID using RelationGetRelid(pd->reldesc). > I suggest that this patch just focus on removing the following things > from PartitionDispatchData: keystate, tupslot, tupmap. Those things > are clearly executor-specific stuff that makes sense to move to a > different structure, what you're calling PartitionTupleRoutingInfo > (not sure that's the best name). The other stuff all seems fine. > You're going to have to open the relation anyway, so keeping the > reldesc around seems like an optimization, if anything. The > PartitionKey and PartitionDesc pointers may not really be needed -- > they're just pointers into reldesc -- but they're trivial to compute, > so it doesn't hurt anything to have them either as a > micro-optimization for performance or even just for readability. OK, done this way in the attached updated patch. Any suggestions about a better name for what the patch calls PartitionTupleRoutingInfo? > That just leaves indexes. In a world where keystate, tupslot, and > tupmap are removed from the PartitionDispatchData, you must need > indexes or there would be no point in constructing a > PartitionDispatchData object in the first place; any application that > needs neither indexes nor the executor-specific stuff could just use > the Relation directly. Agreed. > Regarding your XXX comments, note that if you've got a lock on a > relation, the pointers to the PartitionKey and PartitionDesc are > stable. The PartitionKey can't change once it's established, and the > PartitionDesc can't change while we've got a lock on the relation > unless we change it ourselves (and any places that do should have > CheckTableNotInUse checks). The keep_partkey and keep_partdesc > handling in relcache.c exists exactly so that we can guarantee that > the pointer won't go stale under us. Now, if we *don't* have a lock > on the relation, then those pointers can easily be invalidated -- so > you can't hang onto a PartitionDispatch for longer than you hang onto > the lock on the Relation. But that shouldn't be a problem. I think > you only need to hang onto PartitionDispatch pointers for the lifetime > of a single query. One can imagine optimizations where we try to > avoid rebuilding that for subsequent queries but I'm not sure there's > any demonstrated need for such a system at present. Here too. Attached are the updated patches. Thanks, Amit
From fb4bd4818c4faa08b3c4d37709f01dc55f256a46 Mon Sep 17 00:00:00 2001 From: amit <amitlangot...@gmail.com> Date: Mon, 24 Jul 2017 18:59:57 +0900 Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor Currently it and the structure it generates viz. PartitionDispatch objects are too coupled with the executor's tuple-routing code. In particular, it's pretty undesirable that it makes it the responsibility of the caller to release some resources, such as executor tuple table slots, tuple-conversion maps, etc. That makes it harder to use in places other than where it's currently being used. After this refactoring, ExecSetupPartitionTupleRouting() now needs to do some of the work that was previously done in RelationGetPartitionDispatchInfo(). --- src/backend/catalog/partition.c | 278 +++++++++++++++------------------ src/backend/commands/copy.c | 37 +++-- src/backend/executor/execMain.c | 124 +++++++++++++-- src/backend/executor/nodeModifyTable.c | 37 +++-- src/include/catalog/partition.h | 34 ++-- src/include/executor/executor.h | 4 +- src/include/nodes/execnodes.h | 40 ++++- 7 files changed, 326 insertions(+), 228 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 96a64ce6b2..25fc4583de 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -981,181 +981,147 @@ get_partition_qual_relid(Oid relid) } /* - * Append OIDs of rel's partitions to the list 'partoids' and for each OID, - * append pointer rel to the list 'parents'. - */ -#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \ - do\ - {\ - int i;\ - for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\ - {\ - (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\ - (parents) = lappend((parents), (rel));\ - }\ - } while(0) - -/* * RelationGetPartitionDispatchInfo - * Returns information necessary to route tuples down a partition tree + * Returns necessary information for each partition in the partition + * tree rooted at rel * - * The number of elements in the returned array (that is, the number of - * PartitionDispatch objects for the partitioned tables in the partition tree) - * is returned in *num_parted and a list of the OIDs of all the leaf - * partitions of rel is returned in *leaf_part_oids. + * A list of PartitionDispatch objects is returned, which contains one object + * for each partitioned table in the partition tree (with at least one member, + * that is, the one for the root partitioned table). Also, upon return, + * *leaf_part_oids will contain a list of the OIDs of all the leaf partitions + * in the partition tree. * - * All the relations in the partition tree (including 'rel') must have been - * locked (using at least the AccessShareLock) by the caller. + * We require that the caller has locked at least the partitioned tables in + * the partition tree (including 'rel') using at least the AccessShareLock, + * because we need to look at their relcache entries to examine its + * PartitionDesc. + * + * It's the responsibility of the caller to close the relation descriptor + * reference contained in each PartitionDispatch object. */ -PartitionDispatch * -RelationGetPartitionDispatchInfo(Relation rel, - int *num_parted, List **leaf_part_oids) +List * +RelationGetPartitionDispatchInfo(Relation rel, List **leaf_part_oids) { - PartitionDispatchData **pd; - List *all_parts = NIL, - *all_parents = NIL, - *parted_rels, - *parted_rel_parents; + List *result = NIL, + *all_parts, + *all_parents; ListCell *lc1, *lc2; int i, - k, offset; /* * We rely on the relcache to traverse the partition tree to build both - * the leaf partition OIDs list and the array of PartitionDispatch objects - * for the partitioned tables in the tree. That means every partitioned - * table in the tree must be locked, which is fine since we require the - * caller to lock all the partitions anyway. + * the leaf partition OIDs list and the list of PartitionDispatch objects + * for the partitioned tables. That means every partitioned table in the + * tree must be locked, which is fine since the callers must have done + * that already. * * For every partitioned table in the tree, starting with the root * partitioned table, add its relcache entry to parted_rels, while also * queuing its partitions (in the order in which they appear in the * partition descriptor) to be looked at later in the same loop. This is * a bit tricky but works because the foreach() macro doesn't fetch the - * next list element until the bottom of the loop. + * next list element until the bottom of the loop. Non-partitioned tables + * are simply added to the leaf partitions list. */ - *num_parted = 1; - parted_rels = list_make1(rel); - /* Root partitioned table has no parent, so NULL for parent */ - parted_rel_parents = list_make1(NULL); - APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents); + i = offset = 0; + *leaf_part_oids = NIL; + + /* Start with the root table. */ + all_parts = list_make1_oid(RelationGetRelid(rel)); + all_parents = list_make1_oid(InvalidOid); forboth(lc1, all_parts, lc2, all_parents) { - Oid partrelid = lfirst_oid(lc1); - Relation parent = lfirst(lc2); + Oid partrelid = lfirst_oid(lc1); + Oid parentrelid = lfirst_oid(lc2); if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE) { - /* - * Already locked by the caller. Note that it is the - * responsibility of the caller to close the below relcache entry, - * once done using the information being collected here (for - * example, in ExecEndModifyTable). - */ - Relation partrel = heap_open(partrelid, NoLock); + int j, + k; + Relation partrel; + PartitionDesc partdesc; + PartitionDispatch pd; + + if (partrelid != RelationGetRelid(rel)) + partrel = heap_open(partrelid, NoLock); + else + partrel = rel; - (*num_parted)++; - parted_rels = lappend(parted_rels, partrel); - parted_rel_parents = lappend(parted_rel_parents, parent); - APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents); - } - } + partdesc = RelationGetPartitionDesc(partrel); + + pd = (PartitionDispatchData *) + palloc0(sizeof(PartitionDispatchData)); + pd->reldesc = partrel; + pd->parentoid = parentrelid; - /* - * We want to create two arrays - one for leaf partitions and another for - * partitioned tables (including the root table and internal partitions). - * While we only create the latter here, leaf partition array of suitable - * objects (such as, ResultRelInfo) is created by the caller using the - * list of OIDs we return. Indexes into these arrays get assigned in a - * breadth-first manner, whereby partitions of any given level are placed - * consecutively in the respective arrays. - */ - pd = (PartitionDispatchData **) palloc(*num_parted * - sizeof(PartitionDispatchData *)); - *leaf_part_oids = NIL; - i = k = offset = 0; - forboth(lc1, parted_rels, lc2, parted_rel_parents) - { - Relation partrel = lfirst(lc1); - Relation parent = lfirst(lc2); - PartitionKey partkey = RelationGetPartitionKey(partrel); - TupleDesc tupdesc = RelationGetDescr(partrel); - PartitionDesc partdesc = RelationGetPartitionDesc(partrel); - int j, - m; - - pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData)); - pd[i]->reldesc = partrel; - pd[i]->key = partkey; - pd[i]->keystate = NIL; - pd[i]->partdesc = partdesc; - if (parent != NULL) - { /* - * For every partitioned table other than root, we must store a - * tuple table slot initialized with its tuple descriptor and a - * tuple conversion map to convert a tuple from its parent's - * rowtype to its own. That is to make sure that we are looking at - * the correct row using the correct tuple descriptor when - * computing its partition key for tuple routing. + * The values contained in the following array correspond to + * indexes of this table's partitions in the global sequence of + * all the partitions contained in the partition tree rooted at + * rel, traversed in a breadh-first manner. The values should be + * such that we will be able to distinguish the leaf partitions + * from the non-leaf partitions, because they are returned to + * to the caller in separate structures from where they will be + * accessed. The way that's done is described below: + * + * Leaf partition OIDs are put into the global leaf_part_oids list, + * and for each one, the value stored is its ordinal position in + * the list minus 1. + * + * PartitionDispatch objects corresponding to partitions that + * are partitioned tables are put into the global result list, + * and for each one, the value stored is its ordinal position in + * the list multiplied by -1. + * + * So, when examining the values in the indexes array, getting a + * value >= 0 means the corresponding partition is a leaf + * partition. Otherwise, it's a partitioned table. */ - pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc); - pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent), - tupdesc, - gettext_noop("could not convert row type")); - } - else - { - /* Not required for the root partitioned table */ - pd[i]->tupslot = NULL; - pd[i]->tupmap = NULL; - } - pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int)); - - /* - * Indexes corresponding to the internal partitions are multiplied by - * -1 to distinguish them from those of leaf partitions. Encountering - * an index >= 0 means we found a leaf partition, which is immediately - * returned as the partition we are looking for. A negative index - * means we found a partitioned table, whose PartitionDispatch object - * is located at the above index multiplied back by -1. Using the - * PartitionDispatch object, search is continued further down the - * partition tree. - */ - m = 0; - for (j = 0; j < partdesc->nparts; j++) - { - Oid partrelid = partdesc->oids[j]; + pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int)); - if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE) - { - *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid); - pd[i]->indexes[j] = k++; - } - else + k = 0; + for (j = 0; j < partdesc->nparts; j++) { + Oid partrelid = partdesc->oids[j]; + /* - * offset denotes the number of partitioned tables of upper - * levels including those of the current level. Any partition - * of this table must belong to the next level and hence will - * be placed after the last partitioned table of this level. + * Queue this partition so that it will be processed later + * by the outer loop. */ - pd[i]->indexes[j] = -(1 + offset + m); - m++; + all_parts = lappend_oid(all_parts, partrelid); + all_parents = lappend_oid(all_parents, + RelationGetRelid(partrel)); + + if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE) + { + *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid); + pd->indexes[j] = i++; + } + else + { + /* + * offset denotes the number of partitioned tables that + * we have already processed. k counts the number of + * partitions of this table that were found to be + * partitioned tables. + */ + pd->indexes[j] = -(1 + offset + k); + k++; + } } - } - i++; - /* - * This counts the number of partitioned tables at upper levels - * including those of the current level. - */ - offset += m; + offset += k; + + result = lappend(result, pd); + } } - return pd; + Assert(i == list_length(*leaf_part_oids)); + Assert((offset + 1) == list_length(result)); + + return result; } /* Module-local functions */ @@ -1860,7 +1826,7 @@ generate_partition_qual(Relation rel) * Construct values[] and isnull[] arrays for the partition key * of a tuple. * - * pd Partition dispatch object of the partitioned table + * ptrinfo PartitionTupleRoutingInfo object of the table * slot Heap tuple from which to extract partition key * estate executor state for evaluating any partition key * expressions (must be non-NULL) @@ -1872,29 +1838,30 @@ generate_partition_qual(Relation rel) * ---------------- */ void -FormPartitionKeyDatum(PartitionDispatch pd, +FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo, TupleTableSlot *slot, EState *estate, Datum *values, bool *isnull) { + PartitionKey key = RelationGetPartitionKey(ptrinfo->pd->reldesc); ListCell *partexpr_item; int i; - if (pd->key->partexprs != NIL && pd->keystate == NIL) + if (key->partexprs != NIL && ptrinfo->keystate == NIL) { /* Check caller has set up context correctly */ Assert(estate != NULL && GetPerTupleExprContext(estate)->ecxt_scantuple == slot); /* First time through, set up expression evaluation state */ - pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate); + ptrinfo->keystate = ExecPrepareExprList(key->partexprs, estate); } - partexpr_item = list_head(pd->keystate); - for (i = 0; i < pd->key->partnatts; i++) + partexpr_item = list_head(ptrinfo->keystate); + for (i = 0; i < key->partnatts; i++) { - AttrNumber keycol = pd->key->partattrs[i]; + AttrNumber keycol = key->partattrs[i]; Datum datum; bool isNull; @@ -1931,13 +1898,13 @@ FormPartitionKeyDatum(PartitionDispatch pd, * the latter case. */ int -get_partition_for_tuple(PartitionDispatch *pd, +get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos, TupleTableSlot *slot, EState *estate, - PartitionDispatchData **failed_at, + PartitionTupleRoutingInfo **failed_at, TupleTableSlot **failed_slot) { - PartitionDispatch parent; + PartitionTupleRoutingInfo *parent; Datum values[PARTITION_MAX_KEYS]; bool isnull[PARTITION_MAX_KEYS]; int cur_offset, @@ -1948,11 +1915,12 @@ get_partition_for_tuple(PartitionDispatch *pd, TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple; /* start with the root partitioned table */ - parent = pd[0]; + parent = ptrinfos[0]; while (true) { - PartitionKey key = parent->key; - PartitionDesc partdesc = parent->partdesc; + PartitionDispatch pd = parent->pd; + PartitionKey key = RelationGetPartitionKey(pd->reldesc); + PartitionDesc partdesc = RelationGetPartitionDesc(pd->reldesc); TupleTableSlot *myslot = parent->tupslot; TupleConversionMap *map = parent->tupmap; @@ -2055,13 +2023,13 @@ get_partition_for_tuple(PartitionDispatch *pd, *failed_slot = slot; break; } - else if (parent->indexes[cur_index] >= 0) + else if (pd->indexes[cur_index] >= 0) { - result = parent->indexes[cur_index]; + result = pd->indexes[cur_index]; break; } else - parent = pd[-parent->indexes[cur_index]]; + parent = ptrinfos[-pd->indexes[cur_index]]; } error_exit: diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfa3f059c2..b0c596345b 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -165,8 +165,8 @@ typedef struct CopyStateData bool volatile_defexprs; /* is any of defexprs volatile? */ List *range_table; - PartitionDispatch *partition_dispatch_info; - int num_dispatch; /* Number of entries in the above array */ + PartitionTupleRoutingInfo **ptrinfos; + int num_parted; /* Number of entries in the above array */ int num_partitions; /* Number of members in the following arrays */ ResultRelInfo *partitions; /* Per partition result relation */ TupleConversionMap **partition_tupconv_maps; @@ -2445,7 +2445,7 @@ CopyFrom(CopyState cstate) */ if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { - PartitionDispatch *partition_dispatch_info; + PartitionTupleRoutingInfo **ptrinfos; ResultRelInfo *partitions; TupleConversionMap **partition_tupconv_maps; TupleTableSlot *partition_tuple_slot; @@ -2455,13 +2455,13 @@ CopyFrom(CopyState cstate) ExecSetupPartitionTupleRouting(cstate->rel, 1, estate, - &partition_dispatch_info, + &ptrinfos, &partitions, &partition_tupconv_maps, &partition_tuple_slot, &num_parted, &num_partitions); - cstate->partition_dispatch_info = partition_dispatch_info; - cstate->num_dispatch = num_parted; + cstate->ptrinfos = ptrinfos; + cstate->num_parted = num_parted; cstate->partitions = partitions; cstate->num_partitions = num_partitions; cstate->partition_tupconv_maps = partition_tupconv_maps; @@ -2502,7 +2502,7 @@ CopyFrom(CopyState cstate) if ((resultRelInfo->ri_TrigDesc != NULL && (resultRelInfo->ri_TrigDesc->trig_insert_before_row || resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) || - cstate->partition_dispatch_info != NULL || + cstate->ptrinfos != NULL || cstate->volatile_defexprs) { useHeapMultiInsert = false; @@ -2580,7 +2580,7 @@ CopyFrom(CopyState cstate) ExecStoreTuple(tuple, slot, InvalidBuffer, false); /* Determine the partition to heap_insert the tuple into */ - if (cstate->partition_dispatch_info) + if (cstate->ptrinfos) { int leaf_part_index; TupleConversionMap *map; @@ -2594,7 +2594,7 @@ CopyFrom(CopyState cstate) * partition, respectively. */ leaf_part_index = ExecFindPartition(resultRelInfo, - cstate->partition_dispatch_info, + cstate->ptrinfos, slot, estate); Assert(leaf_part_index >= 0 && @@ -2826,24 +2826,23 @@ CopyFrom(CopyState cstate) ExecCloseIndices(resultRelInfo); - /* Close all the partitioned tables, leaf partitions, and their indices */ - if (cstate->partition_dispatch_info) + /* Release some resources that we acquired for tuple-routing. */ + if (cstate->ptrinfos) { int i; /* - * Remember cstate->partition_dispatch_info[0] corresponds to the root - * partitioned table, which we must not try to close, because it is - * the main target table of COPY that will be closed eventually by - * DoCopy(). Also, tupslot is NULL for the root partitioned table. + * cstate->ptrinfos[0] corresponds to the root partitioned table, for + * which we didn't create tupslot. */ - for (i = 1; i < cstate->num_dispatch; i++) + for (i = 1; i < cstate->num_parted; i++) { - PartitionDispatch pd = cstate->partition_dispatch_info[i]; + PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i]; - heap_close(pd->reldesc, NoLock); - ExecDropSingleTupleTableSlot(pd->tupslot); + ExecDropSingleTupleTableSlot(ptrinfo->tupslot); } + + /* Close all the leaf partitions and their indices */ for (i = 0; i < cstate->num_partitions; i++) { ResultRelInfo *resultRelInfo = cstate->partitions + i; diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 2946a0edee..493ade0e78 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -3236,8 +3236,8 @@ EvalPlanQualEnd(EPQState *epqstate) * tuple routing for partitioned tables * * Output arguments: - * 'pd' receives an array of PartitionDispatch objects with one entry for - * every partitioned table in the partition tree + * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one + * entry for each partitioned table in the partition tree * 'partitions' receives an array of ResultRelInfo objects with one entry for * every leaf partition in the partition tree * 'tup_conv_maps' receives an array of TupleConversionMap objects with one @@ -3260,7 +3260,7 @@ void ExecSetupPartitionTupleRouting(Relation rel, Index resultRTindex, EState *estate, - PartitionDispatch **pd, + PartitionTupleRoutingInfo ***ptrinfos, ResultRelInfo **partitions, TupleConversionMap ***tup_conv_maps, TupleTableSlot **partition_tuple_slot, @@ -3268,16 +3268,116 @@ ExecSetupPartitionTupleRouting(Relation rel, { TupleDesc tupDesc = RelationGetDescr(rel); List *leaf_parts; + List *pdlist; ListCell *cell; int i; ResultRelInfo *leaf_part_rri; + Relation parent; /* * Get the information about the partition tree after locking all the * partitions. */ (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL); - *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts); + pdlist = RelationGetPartitionDispatchInfo(rel, &leaf_parts); + + /* + * The pdlist list contains PartitionDispatch objects for all the + * partitioned tables in the partition tree. Using the information + * therein, we construct an array of PartitionTupleRoutingInfo objects + * to be used during tuple-routing. + */ + *num_parted = list_length(pdlist); + *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted * + sizeof(PartitionTupleRoutingInfo *)); + /* + * Free the ptinfos List structure itself as we go through (open-coded + * list_free). + */ + i = 0; + cell = list_head(pdlist); + parent = NULL; + while (cell) + { + ListCell *tmp = cell; + PartitionDispatch pd = lfirst(tmp), + next_pd = NULL; + Relation partrel; + PartitionTupleRoutingInfo *ptrinfo; + + if (lnext(tmp)) + next_pd = lfirst(lnext(tmp)); + + partrel = pd->reldesc; + + ptrinfo = (PartitionTupleRoutingInfo *) + palloc0(sizeof(PartitionTupleRoutingInfo)); + /* Stash a reference to this PartitionDispatch. */ + ptrinfo->pd = pd; + + /* State for extracting partition key from tuples will go here. */ + ptrinfo->keystate = NIL; + + /* + * For every partitioned table other than root, we must store a tuple + * table slot initialized with its tuple descriptor and a tuple + * conversion map to convert a tuple from its parent's rowtype to its + * own. That is to make sure that we are looking at the correct row + * using the correct tuple descriptor when computing its partition key + * for tuple routing. + */ + if (pd->parentoid != InvalidOid) + { + TupleDesc tupdesc = RelationGetDescr(partrel); + + /* Open the parent relation descriptor if not already done. */ + if (pd->parentoid == RelationGetRelid(rel)) + { + parent = rel; + } + else if (parent == NULL) + { + /* Locked by RelationGetPartitionDispatchInfo(). */ + parent = heap_open(pd->parentoid, NoLock); + } + + ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc); + ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent), + tupdesc, + gettext_noop("could not convert row type")); + + /* + * Close the parent descriptor, if the next partitioned table in + * the list is not a sibling, because it will have a different + * parent if so. + */ + if (parent != NULL && parent != rel && + next_pd != NULL && + next_pd->parentoid != pd->parentoid) + { + heap_close(parent, NoLock); + parent = NULL; + } + } + else + { + /* Not required for the root partitioned table */ + ptrinfo->tupslot = NULL; + ptrinfo->tupmap = NULL; + } + + (*ptrinfos)[i++] = ptrinfo; + + /* Free the ListCell. */ + cell = lnext(cell); + pfree(tmp); + } + + /* Free the List itself. */ + if (pdlist) + pfree(pdlist); + + /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */ *num_partitions = list_length(leaf_parts); *partitions = (ResultRelInfo *) palloc(*num_partitions * sizeof(ResultRelInfo)); @@ -3304,7 +3404,7 @@ ExecSetupPartitionTupleRouting(Relation rel, * Note that each of the relations in *partitions are eventually * closed by the caller. */ - partrel = heap_open(lfirst_oid(cell), NoLock); + partrel = heap_open(lfirst_oid(cell), RowExclusiveLock); part_tupdesc = RelationGetDescr(partrel); /* @@ -3317,7 +3417,7 @@ ExecSetupPartitionTupleRouting(Relation rel, * partition from the parent's type to the partition's. */ (*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc, - gettext_noop("could not convert row type")); + gettext_noop("could not convert row type")); InitResultRelInfo(leaf_part_rri, partrel, @@ -3354,11 +3454,13 @@ ExecSetupPartitionTupleRouting(Relation rel, * by get_partition_for_tuple() unchanged. */ int -ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd, - TupleTableSlot *slot, EState *estate) +ExecFindPartition(ResultRelInfo *resultRelInfo, + PartitionTupleRoutingInfo **ptrinfos, + TupleTableSlot *slot, + EState *estate) { int result; - PartitionDispatchData *failed_at; + PartitionTupleRoutingInfo *failed_at; TupleTableSlot *failed_slot; /* @@ -3368,7 +3470,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd, if (resultRelInfo->ri_PartitionCheck) ExecPartitionCheck(resultRelInfo, slot, estate); - result = get_partition_for_tuple(pd, slot, estate, + result = get_partition_for_tuple(ptrinfos, slot, estate, &failed_at, &failed_slot); if (result < 0) { @@ -3378,7 +3480,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd, char *val_desc; ExprContext *ecxt = GetPerTupleExprContext(estate); - failed_rel = failed_at->reldesc; + failed_rel = failed_at->pd->reldesc; ecxt->ecxt_scantuple = failed_slot; FormPartitionKeyDatum(failed_at, failed_slot, estate, key_values, key_isnull); diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index e12721a9b6..753ee13985 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -278,7 +278,7 @@ ExecInsert(ModifyTableState *mtstate, resultRelInfo = estate->es_result_relation_info; /* Determine the partition to heap_insert the tuple into */ - if (mtstate->mt_partition_dispatch_info) + if (mtstate->mt_ptrinfos) { int leaf_part_index; TupleConversionMap *map; @@ -292,7 +292,7 @@ ExecInsert(ModifyTableState *mtstate, * respectively. */ leaf_part_index = ExecFindPartition(resultRelInfo, - mtstate->mt_partition_dispatch_info, + mtstate->mt_ptrinfos, slot, estate); Assert(leaf_part_index >= 0 && @@ -1487,7 +1487,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate) int numResultRelInfos; /* Find the set of partitions so that we can find their TupleDescs. */ - if (mtstate->mt_partition_dispatch_info != NULL) + if (mtstate->mt_ptrinfos != NULL) { /* * For INSERT via partitioned table, so we need TupleDescs based @@ -1911,7 +1911,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) if (operation == CMD_INSERT && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { - PartitionDispatch *partition_dispatch_info; + PartitionTupleRoutingInfo **ptrinfos; ResultRelInfo *partitions; TupleConversionMap **partition_tupconv_maps; TupleTableSlot *partition_tuple_slot; @@ -1921,13 +1921,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) ExecSetupPartitionTupleRouting(rel, node->nominalRelation, estate, - &partition_dispatch_info, + &ptrinfos, &partitions, &partition_tupconv_maps, &partition_tuple_slot, &num_parted, &num_partitions); - mtstate->mt_partition_dispatch_info = partition_dispatch_info; - mtstate->mt_num_dispatch = num_parted; + mtstate->mt_ptrinfos = ptrinfos; + mtstate->mt_num_parted = num_parted; mtstate->mt_partitions = partitions; mtstate->mt_num_partitions = num_partitions; mtstate->mt_partition_tupconv_maps = partition_tupconv_maps; @@ -2336,21 +2336,24 @@ ExecEndModifyTable(ModifyTableState *node) resultRelInfo); } + /* Release some resources that we acquired for tuple-routing. */ + /* - * Close all the partitioned tables, leaf partitions, and their indices - * - * Remember node->mt_partition_dispatch_info[0] corresponds to the root - * partitioned table, which we must not try to close, because it is the - * main target table of the query that will be closed by ExecEndPlan(). - * Also, tupslot is NULL for the root partitioned table. + * node->mt_ptrinfos[0] corresponds to the root partitioned table, for + * which we didn't create tupslot. Also, its relation descriptor will + * be closed in ExecEndPlan(). */ - for (i = 1; i < node->mt_num_dispatch; i++) + for (i = 1; i < node->mt_num_parted; i++) { - PartitionDispatch pd = node->mt_partition_dispatch_info[i]; + PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i]; - heap_close(pd->reldesc, NoLock); - ExecDropSingleTupleTableSlot(pd->tupslot); + heap_close(ptrinfo->pd->reldesc, NoLock); + ExecDropSingleTupleTableSlot(ptrinfo->tupslot); } + + /* + * Close all the leaf partitions and their indices. + */ for (i = 0; i < node->mt_num_partitions; i++) { ResultRelInfo *resultRelInfo = node->mt_partitions + i; diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h index 2283c675e9..73cfb4e937 100644 --- a/src/include/catalog/partition.h +++ b/src/include/catalog/partition.h @@ -40,31 +40,20 @@ typedef struct PartitionDescData typedef struct PartitionDescData *PartitionDesc; /*----------------------- - * PartitionDispatch - information about one partitioned table in a partition - * hierarchy required to route a tuple to one of its partitions + * PartitionDispatchData - information of partitions of one partitioned table + * in a partition tree * * reldesc Relation descriptor of the table - * key Partition key information of the table - * keystate Execution state required for expressions in the partition key - * partdesc Partition descriptor of the table - * tupslot A standalone TupleTableSlot initialized with this table's tuple - * descriptor - * tupmap TupleConversionMap to convert from the parent's rowtype to - * this table's rowtype (when extracting the partition key of a - * tuple just before routing it through this table) - * indexes Array with partdesc->nparts members (for details on what - * individual members represent, see how they are set in + * parentoid OID of the parent table (InvalidOid if root partitioned table) + * indexes Array with reldesc->rd_partdesc->nparts members (for details on + * what the individual value represents, see the comments in * RelationGetPartitionDispatchInfo()) *----------------------- */ typedef struct PartitionDispatchData { Relation reldesc; - PartitionKey key; - List *keystate; /* list of ExprState */ - PartitionDesc partdesc; - TupleTableSlot *tupslot; - TupleConversionMap *tupmap; + Oid parentoid; int *indexes; } PartitionDispatchData; @@ -86,17 +75,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno, extern List *RelationGetPartitionQual(Relation rel); extern Expr *get_partition_qual_relid(Oid relid); +extern List *RelationGetPartitionDispatchInfo(Relation rel, + List **leaf_part_oids); + /* For tuple routing */ -extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel, - int *num_parted, List **leaf_part_oids); -extern void FormPartitionKeyDatum(PartitionDispatch pd, +extern void FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo, TupleTableSlot *slot, EState *estate, Datum *values, bool *isnull); -extern int get_partition_for_tuple(PartitionDispatch *pd, +extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos, TupleTableSlot *slot, EState *estate, - PartitionDispatchData **failed_at, + PartitionTupleRoutingInfo **failed_at, TupleTableSlot **failed_slot); #endif /* PARTITION_H */ diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index eacbea3c36..44b7cd0fd6 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -209,13 +209,13 @@ extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti); extern void ExecSetupPartitionTupleRouting(Relation rel, Index resultRTindex, EState *estate, - PartitionDispatch **pd, + PartitionTupleRoutingInfo ***ptrinfos, ResultRelInfo **partitions, TupleConversionMap ***tup_conv_maps, TupleTableSlot **partition_tuple_slot, int *num_parted, int *num_partitions); extern int ExecFindPartition(ResultRelInfo *resultRelInfo, - PartitionDispatch *pd, + PartitionTupleRoutingInfo **ptrinfos, TupleTableSlot *slot, EState *estate); diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 3272c4b315..ab28169a96 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -414,6 +414,42 @@ typedef struct ResultRelInfo Relation ri_PartitionRoot; } ResultRelInfo; +/* Forward declarations, to avoid including other headers */ +typedef struct PartitionDispatchData *PartitionDispatch; + +/* + * PartitionTupleRoutingInfo - information required for tuple-routing + * through one partitioned table in a partition + * tree + */ +typedef struct PartitionTupleRoutingInfo +{ + + /* Information about the table's partitions */ + PartitionDispatch pd; + + /* + * The execution state required for expressions contained in the partition + * key. It is NIL until initialized by FormPartitionKeyDatum() if and when + * it is called; for example, the first time a tuple is routed through this + * table. + */ + List *keystate; + + /* + * A standalone TupleTableSlot initialized with this table's tuple + * descriptor + */ + TupleTableSlot *tupslot; + + /* + * TupleConversionMap to convert from the parent's rowtype to this table's + * rowtype (when extracting the partition key of a tuple just before + * routing it through this table) + */ + TupleConversionMap *tupmap; +} PartitionTupleRoutingInfo; + /* ---------------- * EState information * @@ -973,9 +1009,9 @@ typedef struct ModifyTableState TupleTableSlot *mt_existing; /* slot to store existing target tuple in */ List *mt_excludedtlist; /* the excluded pseudo relation's tlist */ TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */ - struct PartitionDispatchData **mt_partition_dispatch_info; /* Tuple-routing support info */ - int mt_num_dispatch; /* Number of entries in the above array */ + struct PartitionTupleRoutingInfo **mt_ptrinfos; + int mt_num_parted; /* Number of entries in the above array */ int mt_num_partitions; /* Number of members in the following * arrays */ ResultRelInfo *mt_partitions; /* Per partition result relation */ -- 2.11.0
From b23ce2d7acbb89636c61b5e10c93211b052ef61a Mon Sep 17 00:00:00 2001 From: amit <amitlangot...@gmail.com> Date: Wed, 9 Aug 2017 15:52:36 +0900 Subject: [PATCH 2/2] Teach expand_inherited_rtentry to use partition bound order After locking the child tables using find_all_inheritors, we discard the list of child table OIDs that it generates and rebuild the same using the information returned by RelationGetPartitionDispatchInfo. --- src/backend/optimizer/prep/prepunion.c | 35 ++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c index e73c819901..1ae1a851d4 100644 --- a/src/backend/optimizer/prep/prepunion.c +++ b/src/backend/optimizer/prep/prepunion.c @@ -33,6 +33,7 @@ #include "access/heapam.h" #include "access/htup_details.h" #include "access/sysattr.h" +#include "catalog/partition.h" #include "catalog/pg_inherits_fn.h" #include "catalog/pg_type.h" #include "miscadmin.h" @@ -1452,6 +1453,40 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) */ oldrelation = heap_open(parentOID, NoLock); + /* + * For partitioned tables, we arrange the child table OIDs such that they + * appear in the partition bound order. + */ + if (rte->relkind == RELKIND_PARTITIONED_TABLE) + { + List *leaf_part_oids, + *pdlist; + + /* Discard the original list. */ + list_free(inhOIDs); + inhOIDs = NIL; + + /* Request partitioning information. */ + pdlist = RelationGetPartitionDispatchInfo(oldrelation, + &leaf_part_oids); + + /* + * First collect the partitioned child table OIDs, which includes the + * root parent at the head. + */ + foreach(l, pdlist) + { + PartitionDispatch pd = lfirst(l); + + inhOIDs = lappend_oid(inhOIDs, RelationGetRelid(pd->reldesc)); + if (pd->reldesc != oldrelation) + heap_close(pd->reldesc, NoLock); + } + + /* Concatenate the leaf partition OIDs. */ + inhOIDs = list_concat(inhOIDs, leaf_part_oids); + } + /* Scan the inheritance set and expand it */ appinfos = NIL; has_child = false; -- 2.11.0
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers