On 2017/08/05 2:25, Robert Haas wrote:
> Concretely, my proposal is:
> 
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).
> 
> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
> 
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.
> 
> 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
> to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
> elog(ERROR, ...).  Might be interesting to stuff that check into the
> relation_open(..., NoLock) path, too.
> 
> One objection to this line of attack is that there might be a good
> case for locking only the partitioned inheritors first and then going
> back and locking the leaf nodes in a second pass, or even only when
> required for a particular row.  However, that doesn't require putting
> everything in bound order - it only requires moving the partitioned
> children to the beginning of the list.  And I think rather than having
> new logic for that, we should teach find_inheritance_children() to do
> that directly.  I have a feeling Ashutosh is going to cringe at this
> suggestion, but my idea is to do this by denormalizing: add a column
> to pg_inherits indicating whether the child is of
> RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
> pg_inherits, it can pull that flag out for free along with the
> relation OID, and qsort() first by the flag and then by the OID.  It
> can also return the number of initial elements of its return value
> which have that flag set.
> 
> Then, in find_all_inheritors, we split rels_list into
> partitioned_rels_list and other_rels_list, and process
> partitioned_rels_list in its entirety before touching other_rels_list;
> they get concatenated at the end.
> 
> Now, find_all_inheritors and find_inheritance_children can also grow a
> flag bool only_partitioned_children; if set, then we skip the
> unpartitioned children entirely.
> 
> With all that in place, you can call find_all_inheritors(blah blah,
> false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
> true) to lock just the partitioned tables in the hierarchy.  You get a
> consistent lock order either way, and if you start with only the
> partitioned tables and later want the leaf partitions too, you just go
> through the partitioned children in the order they were returned and
> find_inheritance_children(blah blah, false) on each one of them and
> the lock order is exactly consistent with what you would have gotten
> if you'd done find_all_inheritors(blah blah, false) originally.

I tried implementing this in the attached set of patches.

[PATCH 2/5] Teach pg_inherits.c a bit about partitioning

Both find_inheritance_children and find_all_inheritors now list
partitioned child tables before non-partitioned ones and return
the number of partitioned tables in an optional output argument

[PATCH 3/5] Relieve RelationGetPartitionDispatchInfo() of doing locking

Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.

TODO: Add RelationLockHeldByMe() and put if (!RelationLockHeldByMe())
elog(ERROR, ...) check in RelationGetPartitionDispatchInfo()

[PATCH 4/5] Teach expand_inherited_rtentry to use partition bound order

After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.

[PATCH 5/5] Store in pg_inherits if a child is a partitioned table

Catalog changes so that is_partitioned property of child tables is now
stored in pg_inherits.  This avoids consulting syscache to get that
property as is currently implemented in patch 2/5.


I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes.  But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).

With the above patches, we get the desired order of child sub-plans in
Append and ModifyTable plans for partitioned tables:

create table p (a int) partition by range (a);
create table p4 partition of p for values from (30) to (40);
create table p3 partition of p for values from (20) to (30);
create table p2 partition of p for values from (10) to (20);
create table p1 partition of p for values from (1) to (10);
create table p0 partition of p for values from (minvalue) to (1) partition
by list (a);
create table p00 partition of p0 for values in (0);
create table p01 partition of p0 for values in (-1);
create table p02 partition of p0 for values in (-2);


explain select count(*) from p;
                            QUERY PLAN
-------------------------------------------------------------------
 Aggregate  (cost=293.12..293.13 rows=1 width=8)
   ->  Append  (cost=0.00..248.50 rows=17850 width=0)
         ->  Seq Scan on p1  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p2  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p3  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p4  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p02  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p01  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p00  (cost=0.00..35.50 rows=2550 width=0)

explain update p set a = a;
                          QUERY PLAN
--------------------------------------------------------------
 Update on p  (cost=0.00..248.50 rows=17850 width=10)
   Update on p1
   Update on p2
   Update on p3
   Update on p4
   Update on p02
   Update on p01
   Update on p00
   ->  Seq Scan on p1  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p2  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p3  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p4  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p02  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p01  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p00  (cost=0.00..35.50 rows=2550 width=10)
(15 rows)

> P.S. While I haven't reviewed 0002 in detail, I think the concept of
> minimizing what needs to be built in RelationGetPartitionDispatchInfo
> is a very good idea.

I put this patch ahead in the list and so it's now 0001.

Thanks,
Amit
From f511186bfc3be54ce77b27541695c4c609a877a6 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/5] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots.  That makes it harder to use in places other
than where it's currently being used.

After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and get_all_partition_oids() no
longer needs to do some things that it used to.
---
 src/backend/catalog/partition.c        | 324 +++++++++++++++++----------------
 src/backend/commands/copy.c            |  35 ++--
 src/backend/executor/execMain.c        | 158 ++++++++++++++--
 src/backend/executor/nodeModifyTable.c |  29 ++-
 src/include/catalog/partition.h        |  53 ++----
 src/include/executor/executor.h        |   4 +-
 src/include/nodes/execnodes.h          |  53 +++++-
 7 files changed, 409 insertions(+), 247 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index dcc7f8af27..3d72d08c35 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
        bool            lower;                  /* this is the lower (vs upper) 
bound */
 } PartitionRangeBound;
 
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ *                                                in a partition tree
+ *
+ *     partkey         Partition key of the table
+ *     partdesc        Partition descriptor of the table
+ *     indexes         Array with partdesc->nparts members (for details on 
what the
+ *                             individual value represents, see the comments in
+ *                             RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+       PartitionKey    partkey;        /* Points into the table's relcache 
entry */
+       PartitionDesc   partdesc;       /* Ditto */
+       int                        *indexes;
+} PartitionDispatchData;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
                                                           void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -976,178 +994,167 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-       do\
-       {\
-               int             i;\
-               for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-               {\
-                       (partoids) = lappend_oid((partoids), 
(rel)->rd_partdesc->oids[i]);\
-                       (parents) = lappend((parents), (rel));\
-               }\
-       } while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
- *             Returns information necessary to route tuples down a partition 
tree
+ *             Returns necessary information for each partition in the 
partition
+ *             tree rooted at rel
  *
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
+ *
+ * Note that we lock only those partitions that are partitioned tables, because
+ * we need to look at its relcache entry to get its PartitionKey and its
+ * PartitionDesc. It's the caller's responsibility to lock the leaf partitions
+ * that will actually be accessed during a given query.
  */
-PartitionDispatch *
+void
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
-                                                                int 
*num_parted, List **leaf_part_oids)
+                                                                List 
**ptinfos, List **leaf_part_oids)
 {
-       PartitionDispatchData **pd;
-       List       *all_parts = NIL,
-                          *all_parents = NIL,
-                          *parted_rels,
-                          *parted_rel_parents;
+       List       *all_parts,
+                          *all_parents;
        ListCell   *lc1,
                           *lc2;
        int                     i,
-                               k,
                                offset;
 
        /*
-        * Lock partitions and make a list of the partitioned ones to prepare
-        * their PartitionDispatch objects below.
+        * We rely on the relcache to traverse the partition tree, building
+        * both the leaf partition OIDs list and the PartitionedTableInfo list.
+        * Starting with the root partitioned table for which we already have 
the
+        * relcache entry, we look at its partition descriptor to get the
+        * partition OIDs.  For partitions that are themselves partitioned 
tables,
+        * we get their relcache entries after locking them with lockmode and
+        * queue their partitions to be looked at later.  Leaf partitions are
+        * added to the result list without locking.  For each partitioned 
table,
+        * we build a PartitionedTableInfo object and add it to the other result
+        * list.
         *
-        * Cannot use find_all_inheritors() here, because then the order of OIDs
-        * in parted_rels list would be unknown, which does not help, because we
-        * assign indexes within individual PartitionDispatch in an order that 
is
-        * predetermined (determined by the order of OIDs in individual 
partition
-        * descriptors).
+        * Since RelationBuildPartitionDescriptor() puts partitions in a 
canonical
+        * order determined by comparing partition bounds, we can rely that
+        * concurrent backends see the partitions in the same order, ensuring 
that
+        * there are no deadlocks when locking the partitions.
         */
-       *num_parted = 1;
-       parted_rels = list_make1(rel);
-       /* Root partitioned table has no parent, so NULL for parent */
-       parted_rel_parents = list_make1(NULL);
-       APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+       i = offset = 0;
+       *ptinfos = *leaf_part_oids = NIL;
+
+       /* Start with the root table. */
+       all_parts = list_make1_oid(RelationGetRelid(rel));
+       all_parents = list_make1_oid(InvalidOid);
        forboth(lc1, all_parts, lc2, all_parents)
        {
-               Relation        partrel = heap_open(lfirst_oid(lc1), lockmode);
-               Relation        parent = lfirst(lc2);
-               PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
+               Oid             partrelid = lfirst_oid(lc1);
+               Oid             parentrelid = lfirst_oid(lc2);
 
-               /*
-                * If this partition is a partitioned table, add its children 
to the
-                * end of the list, so that they are processed as well.
-                */
-               if (partdesc)
+               if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
                {
-                       (*num_parted)++;
-                       parted_rels = lappend(parted_rels, partrel);
-                       parted_rel_parents = lappend(parted_rel_parents, 
parent);
-                       APPEND_REL_PARTITION_OIDS(partrel, all_parts, 
all_parents);
-               }
-               else
-                       heap_close(partrel, NoLock);
+                       int             j,
+                                       k;
+                       Relation                partrel;
+                       PartitionKey    partkey;
+                       PartitionDesc   partdesc;
+                       PartitionedTableInfo   *ptinfo;
+                       PartitionDispatch               pd;
+
+                       if (partrelid != RelationGetRelid(rel))
+                               partrel = heap_open(partrelid, lockmode);
+                       else
+                               partrel = rel;
 
-               /*
-                * We keep the partitioned ones open until we're done using the
-                * information being collected here (for example, see
-                * ExecEndModifyTable).
-                */
-       }
+                       partkey = RelationGetPartitionKey(partrel);
+                       partdesc = RelationGetPartitionDesc(partrel);
+
+                       ptinfo = (PartitionedTableInfo *)
+                                                                       
palloc0(sizeof(PartitionedTableInfo));
+                       ptinfo->relid = partrelid;
+                       ptinfo->parentid = parentrelid;
+
+                       ptinfo->pd = pd = (PartitionDispatchData *)
+                                                                       
palloc0(sizeof(PartitionDispatchData));
+                       pd->partkey = partkey;
 
-       /*
-        * We want to create two arrays - one for leaf partitions and another 
for
-        * partitioned tables (including the root table and internal 
partitions).
-        * While we only create the latter here, leaf partition array of 
suitable
-        * objects (such as, ResultRelInfo) is created by the caller using the
-        * list of OIDs we return.  Indexes into these arrays get assigned in a
-        * breadth-first manner, whereby partitions of any given level are 
placed
-        * consecutively in the respective arrays.
-        */
-       pd = (PartitionDispatchData **) palloc(*num_parted *
-                                                                               
   sizeof(PartitionDispatchData *));
-       *leaf_part_oids = NIL;
-       i = k = offset = 0;
-       forboth(lc1, parted_rels, lc2, parted_rel_parents)
-       {
-               Relation        partrel = lfirst(lc1);
-               Relation        parent = lfirst(lc2);
-               PartitionKey partkey = RelationGetPartitionKey(partrel);
-               TupleDesc       tupdesc = RelationGetDescr(partrel);
-               PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
-               int                     j,
-                                       m;
-
-               pd[i] = (PartitionDispatch) 
palloc(sizeof(PartitionDispatchData));
-               pd[i]->reldesc = partrel;
-               pd[i]->key = partkey;
-               pd[i]->keystate = NIL;
-               pd[i]->partdesc = partdesc;
-               if (parent != NULL)
-               {
                        /*
-                        * For every partitioned table other than root, we must 
store a
-                        * tuple table slot initialized with its tuple 
descriptor and a
-                        * tuple conversion map to convert a tuple from its 
parent's
-                        * rowtype to its own. That is to make sure that we are 
looking at
-                        * the correct row using the correct tuple descriptor 
when
-                        * computing its partition key for tuple routing.
+                        * XXX- do we need a pinning mechanism for partition 
descriptors
+                        * so that there references can be managed 
independently of
+                        * the parent relcache entry? Like 
PinPartitionDesc(partdesc)?
                         */
-                       pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
-                       pd[i]->tupmap = 
convert_tuples_by_name(RelationGetDescr(parent),
-                                                                               
                   tupdesc,
-                                                                               
                   gettext_noop("could not convert row type"));
-               }
-               else
-               {
-                       /* Not required for the root partitioned table */
-                       pd[i]->tupslot = NULL;
-                       pd[i]->tupmap = NULL;
-               }
-               pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+                       pd->partdesc = partdesc;
 
-               /*
-                * Indexes corresponding to the internal partitions are 
multiplied by
-                * -1 to distinguish them from those of leaf partitions.  
Encountering
-                * an index >= 0 means we found a leaf partition, which is 
immediately
-                * returned as the partition we are looking for.  A negative 
index
-                * means we found a partitioned table, whose PartitionDispatch 
object
-                * is located at the above index multiplied back by -1.  Using 
the
-                * PartitionDispatch object, search is continued further down 
the
-                * partition tree.
-                */
-               m = 0;
-               for (j = 0; j < partdesc->nparts; j++)
-               {
-                       Oid                     partrelid = partdesc->oids[j];
+                       /*
+                        * The values contained in the following array 
correspond to
+                        * indexes of this table's partitions in the global 
sequence of
+                        * all the partitions contained in the partition tree 
rooted at
+                        * rel, traversed in a breadh-first manner.  The values 
should be
+                        * such that we will be able to distinguish the leaf 
partitions
+                        * from the non-leaf partitions, because they are 
returned to
+                        * to the caller in separate structures from where they 
will be
+                        * accessed.  The way that's done is described below:
+                        *
+                        * Leaf partition OIDs are put into the global 
leaf_part_oids list,
+                        * and for each one, the value stored is its ordinal 
position in
+                        * the list minus 1.
+                        *
+                        * PartitionedTableInfo objects corresponding to 
partitions that
+                        * are partitioned tables are put into the global 
ptinfos[] list,
+                        * and for each one, the value stored is its ordinal 
position in
+                        * the list multiplied by -1.
+                        *
+                        * So while looking at the values in the indexes array, 
if one
+                        * gets zero or a positive value, then it's a leaf 
partition,
+                        * Otherwise, it's a partitioned table.
+                        */
+                       pd->indexes = (int *) palloc(partdesc->nparts * 
sizeof(int));
 
-                       if (get_rel_relkind(partrelid) != 
RELKIND_PARTITIONED_TABLE)
-                       {
-                               *leaf_part_oids = lappend_oid(*leaf_part_oids, 
partrelid);
-                               pd[i]->indexes[j] = k++;
-                       }
-                       else
+                       k = 0;
+                       for (j = 0; j < partdesc->nparts; j++)
                        {
+                               Oid                     partrelid = 
partdesc->oids[j];
+
                                /*
-                                * offset denotes the number of partitioned 
tables of upper
-                                * levels including those of the current level. 
 Any partition
-                                * of this table must belong to the next level 
and hence will
-                                * be placed after the last partitioned table 
of this level.
+                                * Queue this partition so that it will be 
processed later
+                                * by the outer loop.
                                 */
-                               pd[i]->indexes[j] = -(1 + offset + m);
-                               m++;
+                               all_parts = lappend_oid(all_parts, partrelid);
+                               all_parents = lappend_oid(all_parents,
+                                                                               
  RelationGetRelid(partrel));
+
+                               if (get_rel_relkind(partrelid) != 
RELKIND_PARTITIONED_TABLE)
+                               {
+                                       *leaf_part_oids = 
lappend_oid(*leaf_part_oids, partrelid);
+                                       pd->indexes[j] = i++;
+                               }
+                               else
+                               {
+                                       /*
+                                        * offset denotes the number of 
partitioned tables that
+                                        * we have already processed.  k counts 
the number of
+                                        * partitions of this table that were 
found to be
+                                        * partitioned tables.
+                                        */
+                                       pd->indexes[j] = -(1 + offset + k);
+                                       k++;
+                               }
                        }
-               }
-               i++;
 
-               /*
-                * This counts the number of partitioned tables at upper levels
-                * including those of the current level.
-                */
-               offset += m;
+                       offset += k;
+
+                       /*
+                        * Release the relation descriptor.  Lock that we have 
on the
+                        * table will keep the PartitionDesc that is pointing 
into
+                        * RelationData intact, a pointer to which hope to keep
+                        * through this transaction's commit.
+                        * (XXX - how true is that?)
+                        */
+                       if (partrel != rel)
+                               heap_close(partrel, NoLock);
+
+                       *ptinfos = lappend(*ptinfos, ptinfo);
+               }
        }
 
-       return pd;
+       Assert(i == list_length(*leaf_part_oids));
+       Assert((offset + 1) == list_length(*ptinfos));
 }
 
 /* Module-local functions */
@@ -1864,7 +1871,7 @@ generate_partition_qual(Relation rel)
  * ----------------
  */
 void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
                                          TupleTableSlot *slot,
                                          EState *estate,
                                          Datum *values,
@@ -1873,20 +1880,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
        ListCell   *partexpr_item;
        int                     i;
 
-       if (pd->key->partexprs != NIL && pd->keystate == NIL)
+       if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
        {
                /* Check caller has set up context correctly */
                Assert(estate != NULL &&
                           GetPerTupleExprContext(estate)->ecxt_scantuple == 
slot);
 
                /* First time through, set up expression evaluation state */
-               pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+               keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+                                                                               
                estate);
        }
 
-       partexpr_item = list_head(pd->keystate);
-       for (i = 0; i < pd->key->partnatts; i++)
+       partexpr_item = list_head(keyinfo->keystate);
+       for (i = 0; i < keyinfo->key->partnatts; i++)
        {
-               AttrNumber      keycol = pd->key->partattrs[i];
+               AttrNumber      keycol = keyinfo->key->partattrs[i];
                Datum           datum;
                bool            isNull;
 
@@ -1923,13 +1931,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
  * the latter case.
  */
 int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
                                                TupleTableSlot *slot,
                                                EState *estate,
-                                               PartitionDispatchData 
**failed_at,
+                                               PartitionTupleRoutingInfo 
**failed_at,
                                                TupleTableSlot **failed_slot)
 {
-       PartitionDispatch parent;
+       PartitionTupleRoutingInfo *parent;
        Datum           values[PARTITION_MAX_KEYS];
        bool            isnull[PARTITION_MAX_KEYS];
        int                     cur_offset,
@@ -1940,11 +1948,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
        TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 
        /* start with the root partitioned table */
-       parent = pd[0];
+       parent = ptrinfos[0];
        while (true)
        {
-               PartitionKey key = parent->key;
-               PartitionDesc partdesc = parent->partdesc;
+               PartitionKey  key = parent->pd->partkey;
+               PartitionDesc partdesc = parent->pd->partdesc;
                TupleTableSlot *myslot = parent->tupslot;
                TupleConversionMap *map = parent->tupmap;
 
@@ -1976,7 +1984,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
                 * So update ecxt_scantuple accordingly.
                 */
                ecxt->ecxt_scantuple = slot;
-               FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+               FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, 
isnull);
 
                if (key->strategy == PARTITION_STRATEGY_RANGE)
                {
@@ -2047,13 +2055,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
                        *failed_slot = slot;
                        break;
                }
-               else if (parent->indexes[cur_index] >= 0)
+               else if (parent->pd->indexes[cur_index] >= 0)
                {
-                       result = parent->indexes[cur_index];
+                       result = parent->pd->indexes[cur_index];
                        break;
                }
                else
-                       parent = pd[-parent->indexes[cur_index]];
+                       parent = ptrinfos[-parent->pd->indexes[cur_index]];
        }
 
 error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e296559a..b3de3de454 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
        bool            volatile_defexprs;      /* is any of defexprs volatile? 
*/
        List       *range_table;
 
-       PartitionDispatch *partition_dispatch_info;
-       int                     num_dispatch;   /* Number of entries in the 
above array */
+       PartitionTupleRoutingInfo **ptrinfos;
+       int                     num_parted;             /* Number of entries in 
the above array */
        int                     num_partitions; /* Number of members in the 
following arrays */
        ResultRelInfo *partitions;      /* Per partition result relation */
        TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
                /* Initialize state for CopyFrom tuple routing. */
                if (is_from && rel->rd_rel->relkind == 
RELKIND_PARTITIONED_TABLE)
                {
-                       PartitionDispatch *partition_dispatch_info;
+                       PartitionTupleRoutingInfo **ptrinfos;
                        ResultRelInfo *partitions;
                        TupleConversionMap **partition_tupconv_maps;
                        TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
 
                        ExecSetupPartitionTupleRouting(rel,
                                                                                
   1,
-                                                                               
   &partition_dispatch_info,
+                                                                               
   &ptrinfos,
                                                                                
   &partitions,
                                                                                
   &partition_tupconv_maps,
                                                                                
   &partition_tuple_slot,
                                                                                
   &num_parted, &num_partitions);
-                       cstate->partition_dispatch_info = 
partition_dispatch_info;
-                       cstate->num_dispatch = num_parted;
+                       cstate->ptrinfos = ptrinfos;
+                       cstate->num_parted = num_parted;
                        cstate->partitions = partitions;
                        cstate->num_partitions = num_partitions;
                        cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
        if ((resultRelInfo->ri_TrigDesc != NULL &&
                 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
                  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-               cstate->partition_dispatch_info != NULL ||
+               cstate->ptrinfos != NULL ||
                cstate->volatile_defexprs)
        {
                useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
                ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
                /* Determine the partition to heap_insert the tuple into */
-               if (cstate->partition_dispatch_info)
+               if (cstate->ptrinfos)
                {
                        int                     leaf_part_index;
                        TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
                         * partition, respectively.
                         */
                        leaf_part_index = ExecFindPartition(resultRelInfo,
-                                                                               
                cstate->partition_dispatch_info,
+                                                                               
                cstate->ptrinfos,
                                                                                
                slot,
                                                                                
                estate);
                        Assert(leaf_part_index >= 0 &&
@@ -2818,23 +2818,20 @@ CopyFrom(CopyState cstate)
 
        ExecCloseIndices(resultRelInfo);
 
-       /* Close all the partitioned tables, leaf partitions, and their indices 
*/
-       if (cstate->partition_dispatch_info)
+       /* Close all the leaf partitions and their indices */
+       if (cstate->ptrinfos)
        {
                int                     i;
 
                /*
-                * Remember cstate->partition_dispatch_info[0] corresponds to 
the root
-                * partitioned table, which we must not try to close, because 
it is
-                * the main target table of COPY that will be closed eventually 
by
-                * DoCopy().  Also, tupslot is NULL for the root partitioned 
table.
+                * cstate->ptrinfo[0] corresponds to the root partitioned 
table, for
+                * which we didn't create tupslot.
                 */
-               for (i = 1; i < cstate->num_dispatch; i++)
+               for (i = 1; i < cstate->num_parted; i++)
                {
-                       PartitionDispatch pd = 
cstate->partition_dispatch_info[i];
+                       PartitionTupleRoutingInfo *ptrinfo = 
cstate->ptrinfos[i];
 
-                       heap_close(pd->reldesc, NoLock);
-                       ExecDropSingleTupleTableSlot(pd->tupslot);
+                       ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
                }
                for (i = 0; i < cstate->num_partitions; i++)
                {
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c11aa4fe21..0379e489d9 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3214,8 +3214,8 @@ EvalPlanQualEnd(EPQState *epqstate)
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *             every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ *             entry for each partitioned table in the partition tree
  * 'partitions' receives an array of ResultRelInfo objects with one entry for
  *             every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3237,7 +3237,7 @@ EvalPlanQualEnd(EPQState *epqstate)
 void
 ExecSetupPartitionTupleRouting(Relation rel,
                                                           Index resultRTindex,
-                                                          PartitionDispatch 
**pd,
+                                                          
PartitionTupleRoutingInfo ***ptrinfos,
                                                           ResultRelInfo 
**partitions,
                                                           TupleConversionMap 
***tup_conv_maps,
                                                           TupleTableSlot 
**partition_tuple_slot,
@@ -3245,13 +3245,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
 {
        TupleDesc       tupDesc = RelationGetDescr(rel);
        List       *leaf_parts;
+       List       *ptinfos = NIL;
        ListCell   *cell;
        int                     i;
        ResultRelInfo *leaf_part_rri;
+       Relation        parent;
 
-       /* Get the tuple-routing information and lock partitions */
-       *pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, 
num_parted,
-                                                                               
   &leaf_parts);
+       /*
+        * Get information about the partition tree.  All the partitioned
+        * tables in the tree are locked, but not the leaf partitions.  We
+        * lock them while building their ResultRelInfos below.
+        */
+       RelationGetPartitionDispatchInfo(rel, RowExclusiveLock,
+                                                                        
&ptinfos, &leaf_parts);
+
+       /*
+        * The ptinfos list contains PartitionedTableInfo objects for all the
+        * partitioned tables in the partition tree.  Using the information
+        * therein, we construct an array of PartitionTupleRoutingInfo objects
+        * to be used during tuple-routing.
+        */
+       *num_parted = list_length(ptinfos);
+       *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+                                                                               
sizeof(PartitionTupleRoutingInfo *));
+       /*
+        * Free the ptinfos List structure itself as we go through (open-coded
+        * list_free).
+        */
+       i = 0;
+       cell = list_head(ptinfos);
+       parent = NULL;
+       while (cell)
+       {
+               ListCell   *tmp = cell;
+               PartitionedTableInfo *ptinfo = lfirst(tmp),
+                                                        *next_ptinfo;
+               Relation                partrel;
+               PartitionTupleRoutingInfo *ptrinfo;
+
+               if (lnext(tmp))
+                       next_ptinfo = lfirst(lnext(tmp));
+
+               /* As mentioned above, the partitioned tables have been locked. 
*/
+               if (ptinfo->relid != RelationGetRelid(rel))
+                       partrel = heap_open(ptinfo->relid, NoLock);
+               else
+                       partrel = rel;
+
+               ptrinfo = (PartitionTupleRoutingInfo *)
+                                                       
palloc0(sizeof(PartitionTupleRoutingInfo));
+               ptrinfo->relid = ptinfo->relid;
+
+               /* Stash a reference to this PartitionDispatch. */
+               ptrinfo->pd = ptinfo->pd;
+
+               /* State for extracting partition key from tuples will go here. 
*/
+               ptrinfo->keyinfo = (PartitionKeyInfo *)
+                                                               
palloc0(sizeof(PartitionKeyInfo));
+               ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+               ptrinfo->keyinfo->keystate = NIL;
+
+               /*
+                * For every partitioned table other than root, we must store a 
tuple
+                * table slot initialized with its tuple descriptor and a tuple
+                * conversion map to convert a tuple from its parent's rowtype 
to its
+                * own.  That is to make sure that we are looking at the 
correct row
+                * using the correct tuple descriptor when computing its 
partition key
+                * for tuple routing.
+                */
+               if (ptinfo->parentid != InvalidOid)
+               {
+                       TupleDesc       tupdesc = RelationGetDescr(partrel);
+
+                       /* Open the parent relation descriptor if not already 
done. */
+                       if (ptinfo->parentid == RelationGetRelid(rel))
+                       {
+                               parent = rel;
+                       }
+                       else if (parent == NULL)
+                       {
+                               /* Locked by 
RelationGetPartitionDispatchInfo(). */
+                               parent = heap_open(ptinfo->parentid, NoLock);
+                       }
+
+                       ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+                       ptrinfo->tupmap = 
convert_tuples_by_name(RelationGetDescr(parent),
+                                                                               
                         tupdesc,
+                                                                 
gettext_noop("could not convert row type"));
+
+                       /*
+                        * Close the parent descriptor, if the next partitioned 
table in
+                        * the list is not a sibling, because it will have a 
different
+                        * parent if so.
+                        */
+                       if (parent && parent != rel &&
+                               next_ptinfo->parentid != ptinfo->parentid)
+                       {
+                               heap_close(parent, NoLock);
+                               parent = NULL;
+                       }
+
+                       /*
+                        * Release the relation descriptor.  Lock that we have 
on the
+                        * table will keep the PartitionDesc that is pointing 
into
+                        * RelationData intact, a pointer to which hope to keep
+                        * through this transaction's commit.
+                        * (XXX - how true is that?)
+                        */
+                       if (partrel != rel)
+                               heap_close(partrel, NoLock);
+               }
+               else
+               {
+                       /* Not required for the root partitioned table */
+                       ptrinfo->tupslot = NULL;
+                       ptrinfo->tupmap = NULL;
+               }
+
+               (*ptrinfos)[i++] = ptrinfo;
+
+               /* Free the ListCell. */
+               cell = lnext(cell);
+               pfree(tmp);
+       }
+
+       /* Free the List itself. */
+       if (ptinfos)
+               pfree(ptinfos);
+
+       /* For leaf partitions, we build ResultRelInfos and 
TupleConversionMaps. */
        *num_partitions = list_length(leaf_parts);
        *partitions = (ResultRelInfo *) palloc(*num_partitions *
                                                                                
   sizeof(ResultRelInfo));
@@ -3274,11 +3396,11 @@ ExecSetupPartitionTupleRouting(Relation rel,
                TupleDesc       part_tupdesc;
 
                /*
-                * We locked all the partitions above including the leaf 
partitions.
-                * Note that each of the relations in *partitions are eventually
-                * closed by the caller.
+                * RelationGetPartitionDispatchInfo didn't lock the leaf 
partitions,
+                * so lock here.  Note that each of the relations in 
*partitions are
+                * eventually closed (when the plan is shut down, for instance).
                 */
-               partrel = heap_open(lfirst_oid(cell), NoLock);
+               partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
                part_tupdesc = RelationGetDescr(partrel);
 
                /*
@@ -3291,7 +3413,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
                 * partition from the parent's type to the partition's.
                 */
                (*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, 
part_tupdesc,
-                                                                               
                         gettext_noop("could not convert row type"));
+                                                                
gettext_noop("could not convert row type"));
 
                InitResultRelInfo(leaf_part_rri,
                                                  partrel,
@@ -3325,11 +3447,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
  * by get_partition_for_tuple() unchanged.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
-                                 TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+                                 PartitionTupleRoutingInfo **ptrinfos,
+                                 TupleTableSlot *slot,
+                                 EState *estate)
 {
        int                     result;
-       PartitionDispatchData *failed_at;
+       PartitionTupleRoutingInfo *failed_at;
        TupleTableSlot *failed_slot;
 
        /*
@@ -3339,7 +3463,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
        if (resultRelInfo->ri_PartitionCheck)
                ExecPartitionCheck(resultRelInfo, slot, estate);
 
-       result = get_partition_for_tuple(pd, slot, estate,
+       result = get_partition_for_tuple(ptrinfos, slot, estate,
                                                                         
&failed_at, &failed_slot);
        if (result < 0)
        {
@@ -3349,9 +3473,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
                char       *val_desc;
                ExprContext *ecxt = GetPerTupleExprContext(estate);
 
-               failed_rel = failed_at->reldesc;
+               failed_rel = heap_open(failed_at->relid, NoLock);
                ecxt->ecxt_scantuple = failed_slot;
-               FormPartitionKeyDatum(failed_at, failed_slot, estate,
+               FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
                                                          key_values, 
key_isnull);
                val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
                                                                                
                                key_values,
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 30add8e3c7..00cbee4fb6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
        resultRelInfo = estate->es_result_relation_info;
 
        /* Determine the partition to heap_insert the tuple into */
-       if (mtstate->mt_partition_dispatch_info)
+       if (mtstate->mt_ptrinfos)
        {
                int                     leaf_part_index;
                TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
                 * respectively.
                 */
                leaf_part_index = ExecFindPartition(resultRelInfo,
-                                                                               
        mtstate->mt_partition_dispatch_info,
+                                                                               
        mtstate->mt_ptrinfos,
                                                                                
        slot,
                                                                                
        estate);
                Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState 
*mtstate, EState *estate)
                int             numResultRelInfos;
 
                /* Find the set of partitions so that we can find their 
TupleDescs. */
-               if (mtstate->mt_partition_dispatch_info != NULL)
+               if (mtstate->mt_ptrinfos != NULL)
                {
                        /*
                         * For INSERT via partitioned table, so we need 
TupleDescs based
@@ -1910,7 +1910,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, 
int eflags)
        if (operation == CMD_INSERT &&
                rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
        {
-               PartitionDispatch *partition_dispatch_info;
+               PartitionTupleRoutingInfo **ptrinfos;
                ResultRelInfo *partitions;
                TupleConversionMap **partition_tupconv_maps;
                TupleTableSlot *partition_tuple_slot;
@@ -1919,13 +1919,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, 
int eflags)
 
                ExecSetupPartitionTupleRouting(rel,
                                                                           
node->nominalRelation,
-                                                                          
&partition_dispatch_info,
+                                                                          
&ptrinfos,
                                                                           
&partitions,
                                                                           
&partition_tupconv_maps,
                                                                           
&partition_tuple_slot,
                                                                           
&num_parted, &num_partitions);
-               mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-               mtstate->mt_num_dispatch = num_parted;
+               mtstate->mt_ptrinfos = ptrinfos;
+               mtstate->mt_num_parted = num_parted;
                mtstate->mt_partitions = partitions;
                mtstate->mt_num_partitions = num_partitions;
                mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2335,19 +2335,16 @@ ExecEndModifyTable(ModifyTableState *node)
        }
 
        /*
-        * Close all the partitioned tables, leaf partitions, and their indices
+        * Close all the leaf partitions and their indices.
         *
-        * Remember node->mt_partition_dispatch_info[0] corresponds to the root
-        * partitioned table, which we must not try to close, because it is the
-        * main target table of the query that will be closed by ExecEndPlan().
-        * Also, tupslot is NULL for the root partitioned table.
+        * node->mt_partition_dispatch_info[0] corresponds to the root 
partitioned
+        * table, for which we didn't create tupslot.
         */
-       for (i = 1; i < node->mt_num_dispatch; i++)
+       for (i = 1; i < node->mt_num_parted; i++)
        {
-               PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+               PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
 
-               heap_close(pd->reldesc, NoLock);
-               ExecDropSingleTupleTableSlot(pd->tupslot);
+               ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
        }
        for (i = 0; i < node->mt_num_partitions; i++)
        {
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 434ded37d7..6a0c81b3bd 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
 
 typedef struct PartitionDescData *PartitionDesc;
 
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- *     reldesc         Relation descriptor of the table
- *     key                     Partition key information of the table
- *     keystate        Execution state required for expressions in the 
partition key
- *     partdesc        Partition descriptor of the table
- *     tupslot         A standalone TupleTableSlot initialized with this 
table's tuple
- *                             descriptor
- *     tupmap          TupleConversionMap to convert from the parent's rowtype 
to
- *                             this table's rowtype (when extracting the 
partition key of a
- *                             tuple just before routing it through this table)
- *     indexes         Array with partdesc->nparts members (for details on what
- *                             individual members represent, see how they are 
set in
- *                             RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
  */
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
 {
-       Relation        reldesc;
-       PartitionKey key;
-       List       *keystate;           /* list of ExprState */
-       PartitionDesc partdesc;
-       TupleTableSlot *tupslot;
-       TupleConversionMap *tupmap;
-       int                *indexes;
-} PartitionDispatchData;
+       Oid                             relid;
+       Oid                             parentid;
 
-typedef struct PartitionDispatchData *PartitionDispatch;
+       /*
+        * This contains information about bounds of the partitions of this
+        * table and about where individual partitions are placed in the global
+        * partition tree.
+        */
+       PartitionDispatch pd;
+} PartitionedTableInfo;
 
 extern void RelationBuildPartitionDesc(Relation relation);
 extern bool partition_bounds_equal(PartitionKey key,
@@ -85,18 +72,18 @@ extern List *map_partition_varattnos(List *expr, int 
target_varno,
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
+                                                                List 
**ptinfos, List **leaf_part_oids);
+
 /* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-                                                                int lockmode, 
int *num_parted,
-                                                                List 
**leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
                                          TupleTableSlot *slot,
                                          EState *estate,
                                          Datum *values,
                                          bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **pd,
                                                TupleTableSlot *slot,
                                                EState *estate,
-                                               PartitionDispatchData 
**failed_at,
+                                               PartitionTupleRoutingInfo 
**failed_at,
                                                TupleTableSlot **failed_slot);
 #endif                                                 /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9d03..6e1d3a6d2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, 
Index rti,
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
                                                           Index resultRTindex,
-                                                          PartitionDispatch 
**pd,
+                                                          
PartitionTupleRoutingInfo ***ptrinfos,
                                                           ResultRelInfo 
**partitions,
                                                           TupleConversionMap 
***tup_conv_maps,
                                                           TupleTableSlot 
**partition_tuple_slot,
                                                           int *num_parted, int 
*num_partitions);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-                                 PartitionDispatch *pd,
+                                 PartitionTupleRoutingInfo **ptrinfos,
                                  TupleTableSlot *slot,
                                  EState *estate);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 35c28a6143..1514d62f52 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
        Relation        ri_PartitionRoot;
 } ResultRelInfo;
 
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ *                                               partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key.  It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+       PartitionKey    key;            /* Points into the table's relcache 
entry */
+       List               *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ *                                                        through one 
partitioned table in a partition
+ *                                                        tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+       /* OID of the table */
+       Oid                             relid;
+
+       /* Information about the table's partitions */
+       PartitionDispatch       pd;
+
+       /* See comment above the definition of PartitionKeyInfo */
+       PartitionKeyInfo   *keyinfo;
+
+       /*
+        * A standalone TupleTableSlot initialized with this table's tuple
+        * descriptor
+        */
+       TupleTableSlot *tupslot;
+
+       /*
+        * TupleConversionMap to convert from the parent's rowtype to this 
table's
+        * rowtype (when extracting the partition key of a tuple just before
+        * routing it through this table)
+        */
+       TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
 /* ----------------
  *       EState information
  *
@@ -970,9 +1019,9 @@ typedef struct ModifyTableState
        TupleTableSlot *mt_existing;    /* slot to store existing target tuple 
in */
        List       *mt_excludedtlist;   /* the excluded pseudo relation's tlist 
 */
        TupleTableSlot *mt_conflproj;   /* CONFLICT ... SET ... projection 
target */
-       struct PartitionDispatchData **mt_partition_dispatch_info;
        /* Tuple-routing support info */
-       int                     mt_num_dispatch;        /* Number of entries in 
the above array */
+       struct PartitionTupleRoutingInfo **mt_ptrinfos;
+       int                     mt_num_parted;          /* Number of entries in 
the above array */
        int                     mt_num_partitions;      /* Number of members in 
the following
                                                                         * 
arrays */
        ResultRelInfo *mt_partitions;   /* Per partition result relation */
-- 
2.11.0

From b7ec1ddc2e26e75e0ab092c36461c09e9ca0a9d8 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Tue, 8 Aug 2017 18:42:30 +0900
Subject: [PATCH 2/5] Teach pg_inherits.c a bit about partitioning

Both find_inheritance_children and find_all_inheritors now list
partitioned child tables before non-partitioned ones and return
the number of partitioned tables in an optional output argument
---
 contrib/sepgsql/dml.c                  |   2 +-
 src/backend/catalog/partition.c        |   2 +-
 src/backend/catalog/pg_inherits.c      | 157 ++++++++++++++++++++++++++-------
 src/backend/commands/analyze.c         |   3 +-
 src/backend/commands/lockcmds.c        |   2 +-
 src/backend/commands/publicationcmds.c |   2 +-
 src/backend/commands/tablecmds.c       |  39 ++++----
 src/backend/commands/vacuum.c          |   3 +-
 src/backend/optimizer/prep/prepunion.c |   2 +-
 src/include/catalog/pg_inherits_fn.h   |   5 +-
 10 files changed, 162 insertions(+), 55 deletions(-)

diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c
index b643720e36..6fc279805c 100644
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -333,7 +333,7 @@ sepgsql_dml_privileges(List *rangeTabls, bool 
abort_on_violation)
                if (!rte->inh)
                        tableIds = list_make1_oid(rte->relid);
                else
-                       tableIds = find_all_inheritors(rte->relid, NoLock, 
NULL);
+                       tableIds = find_all_inheritors(rte->relid, NoLock, 
NULL, NULL);
 
                foreach(li, tableIds)
                {
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 3d72d08c35..465e4fc097 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -196,7 +196,7 @@ RelationBuildPartitionDesc(Relation rel)
                return;
 
        /* Get partition oids from pg_inherits */
-       inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+       inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock, 
NULL);
 
        /* Collect bound spec nodes in a list */
        i = 0;
diff --git a/src/backend/catalog/pg_inherits.c 
b/src/backend/catalog/pg_inherits.c
index 245a374fc9..99b1e70de6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -30,9 +30,12 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/memutils.h"
+#include "utils/lsyscache.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+static int32 inhchildinfo_cmp(const void *p1, const void *p2);
+
 /*
  * Entry of a hash table used in find_all_inheritors. See below.
  */
@@ -42,6 +45,30 @@ typedef struct SeenRelsEntry
        ListCell   *numparents_cell;    /* corresponding list cell */
 } SeenRelsEntry;
 
+/* Information about one inheritance child table. */
+typedef struct InhChildInfo
+{
+       Oid                     relid;
+       bool            is_partitioned;
+} InhChildInfo;
+
+#define OID_CMP(o1, o2) \
+               ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0));
+
+static int32
+inhchildinfo_cmp(const void *p1, const void *p2)
+{
+       InhChildInfo c1 = *((const InhChildInfo *) p1);
+       InhChildInfo c2 = *((const InhChildInfo *) p2);
+
+       if (c1.is_partitioned && !c2.is_partitioned)
+               return -1;
+       if (!c1.is_partitioned && c2.is_partitioned)
+               return 1;
+
+       return OID_CMP(c1.relid, c2.relid);
+}
+
 /*
  * find_inheritance_children
  *
@@ -54,7 +81,8 @@ typedef struct SeenRelsEntry
  * against possible DROPs of child relations.
  */
 List *
-find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
+find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
+                                                 int *num_partitioned_children)
 {
        List       *list = NIL;
        Relation        relation;
@@ -62,9 +90,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
        ScanKeyData key[1];
        HeapTuple       inheritsTuple;
        Oid                     inhrelid;
-       Oid                *oidarr;
-       int                     maxoids,
-                               numoids,
+       InhChildInfo *inhchildren;
+       int                     maxchildren,
+                               numchildren,
+                               my_num_partitioned_children,
                                i;
 
        /*
@@ -77,9 +106,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE 
lockmode)
        /*
         * Scan pg_inherits and build a working array of subclass OIDs.
         */
-       maxoids = 32;
-       oidarr = (Oid *) palloc(maxoids * sizeof(Oid));
-       numoids = 0;
+       maxchildren = 32;
+       inhchildren = (InhChildInfo *) palloc(maxchildren * 
sizeof(InhChildInfo));
+       numchildren = 0;
+       my_num_partitioned_children = 0;
 
        relation = heap_open(InheritsRelationId, AccessShareLock);
 
@@ -94,33 +124,47 @@ find_inheritance_children(Oid parentrelId, LOCKMODE 
lockmode)
        while ((inheritsTuple = systable_getnext(scan)) != NULL)
        {
                inhrelid = ((Form_pg_inherits) 
GETSTRUCT(inheritsTuple))->inhrelid;
-               if (numoids >= maxoids)
+               if (numchildren >= maxchildren)
+               {
+                       maxchildren *= 2;
+                       inhchildren = (InhChildInfo *) repalloc(inhchildren,
+                                                                               
maxchildren * sizeof(InhChildInfo));
+               }
+               inhchildren[numchildren].relid = inhrelid;
+
+               if (get_rel_relkind(inhrelid) == RELKIND_PARTITIONED_TABLE)
                {
-                       maxoids *= 2;
-                       oidarr = (Oid *) repalloc(oidarr, maxoids * 
sizeof(Oid));
+                       inhchildren[numchildren].is_partitioned = true;
+                       my_num_partitioned_children++;
                }
-               oidarr[numoids++] = inhrelid;
+               else
+                       inhchildren[numchildren].is_partitioned = false;
+               numchildren++;
        }
 
        systable_endscan(scan);
 
        heap_close(relation, AccessShareLock);
 
+       if (num_partitioned_children)
+               *num_partitioned_children = my_num_partitioned_children;
+
        /*
         * If we found more than one child, sort them by OID.  This ensures
         * reasonably consistent behavior regardless of the vagaries of an
         * indexscan.  This is important since we need to be sure all backends
         * lock children in the same order to avoid needless deadlocks.
         */
-       if (numoids > 1)
-               qsort(oidarr, numoids, sizeof(Oid), oid_cmp);
+       if (numchildren > 1)
+               qsort(inhchildren, numchildren, sizeof(InhChildInfo),
+                         inhchildinfo_cmp);
 
        /*
         * Acquire locks and build the result list.
         */
-       for (i = 0; i < numoids; i++)
+       for (i = 0; i < numchildren; i++)
        {
-               inhrelid = oidarr[i];
+               inhrelid = inhchildren[i].relid;
 
                if (lockmode != NoLock)
                {
@@ -144,7 +188,7 @@ find_inheritance_children(Oid parentrelId, LOCKMODE 
lockmode)
                list = lappend_oid(list, inhrelid);
        }
 
-       pfree(oidarr);
+       pfree(inhchildren);
 
        return list;
 }
@@ -159,18 +203,28 @@ find_inheritance_children(Oid parentrelId, LOCKMODE 
lockmode)
  *             given rel.
  *
  * The specified lock type is acquired on all child relations (but not on the
- * given rel; caller should already have locked it).  If lockmode is NoLock
- * then no locks are acquired, but caller must beware of race conditions
- * against possible DROPs of child relations.
+ * given rel; caller should already have locked it), unless
+ * lock_only_partitioned_children is specified, in which case, only the
+ * child relations that are partitioned tables are locked.  If lockmode is
+ * NoLock then no locks are acquired, but caller must beware of race
+ * conditions against possible DROPs of child relations.
+ *
+ * Returned list of OIDs is such that all the partitioned tables in the tree
+ * appear at the head of the list.  If num_partitioned_children is non-NULL,
+ * *num_partitioned_children returns the number of partitioned child table
+ * OIDs at the head of the list.
  */
 List *
-find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
+find_all_inheritors(Oid parentrelId, LOCKMODE lockmode,
+                                       List **numparents, int 
*num_partitioned_children)
 {
        /* hash table for O(1) rel_oid -> rel_numparents cell lookup */
        HTAB       *seen_rels;
        HASHCTL         ctl;
        List       *rels_list,
-                          *rel_numparents;
+                          *rel_numparents,
+                          *partitioned_rels_list,
+                          *other_rels_list;
        ListCell   *l;
 
        memset(&ctl, 0, sizeof(ctl));
@@ -185,31 +239,71 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, 
List **numparents)
 
        /*
         * We build a list starting with the given rel and adding all direct and
-        * indirect children.  We can use a single list as both the record of
-        * already-found rels and the agenda of rels yet to be scanned for more
-        * children.  This is a bit tricky but works because the foreach() macro
-        * doesn't fetch the next list element until the bottom of the loop.
+        * indirect children.  We can use a single list (rels_list) as both the
+        * record of already-found rels and the agenda of rels yet to be scanned
+        * for more children.  This is a bit tricky but works because the 
foreach()
+        * macro doesn't fetch the next list element until the bottom of the 
loop.
+        *
+        * partitioned_child_rels will contain the OIDs of the partitioned child
+        * tables and other_rels_list will contain the OIDs of the 
non-partitioned
+        * child tables.  Result list will be generated by concatening the two
+        * lists together with partitioned_child_rels appearing first.
         */
        rels_list = list_make1_oid(parentrelId);
+       partitioned_rels_list = list_make1_oid(parentrelId);
+       other_rels_list = NIL;
        rel_numparents = list_make1_int(0);
 
+       if (num_partitioned_children)
+               *num_partitioned_children = 0;
+
        foreach(l, rels_list)
        {
                Oid                     currentrel = lfirst_oid(l);
                List       *currentchildren;
-               ListCell   *lc;
+               ListCell   *lc,
+                                  *first_nonpartitioned_child;
+               int                     cur_num_partitioned_children = 0,
+                                       i;
 
                /* Get the direct children of this rel */
-               currentchildren = find_inheritance_children(currentrel, 
lockmode);
+               currentchildren = find_inheritance_children(currentrel, 
lockmode,
+                                                                               
        &cur_num_partitioned_children);
+
+               if (num_partitioned_children)
+                       *num_partitioned_children += 
cur_num_partitioned_children;
+
+               /*
+                * Append partitioned children to rels_list and 
partitioned_rels_list.
+                * We know for sure that partitioned children don't need the
+                * the de-duplication logic in the following loop, because 
partitioned
+                * tables are not allowed to partiticipate in multiple 
inheritance.
+                */
+               i = 0;
+               foreach(lc, currentchildren)
+               {
+                       if (i < cur_num_partitioned_children)
+                       {
+                               Oid             child_oid = lfirst_oid(lc);
+
+                               rels_list = lappend_oid(rels_list, child_oid);
+                               partitioned_rels_list = 
lappend_oid(partitioned_rels_list,
+                                                                               
                        child_oid);
+                       }
+                       else
+                               break;
+                       i++;
+               }
+               first_nonpartitioned_child = lc;
 
                /*
                 * Add to the queue only those children not already seen. This 
avoids
                 * making duplicate entries in case of multiple inheritance 
paths from
                 * the same parent.  (It'll also keep us from getting into an 
infinite
                 * loop, though theoretically there can't be any cycles in the
-                * inheritance graph anyway.)
+                * inheritance graph anyway.)  Also, add them to the 
other_rels_list.
                 */
-               foreach(lc, currentchildren)
+               for_each_cell(lc, first_nonpartitioned_child)
                {
                        Oid                     child_oid = lfirst_oid(lc);
                        bool            found;
@@ -225,6 +319,7 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, 
List **numparents)
                        {
                                /* if it's not there, add it. expect 1 parent, 
initially. */
                                rels_list = lappend_oid(rels_list, child_oid);
+                               other_rels_list = lappend_oid(other_rels_list, 
child_oid);
                                rel_numparents = lappend_int(rel_numparents, 1);
                                hash_entry->numparents_cell = 
rel_numparents->tail;
                        }
@@ -237,8 +332,10 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, 
List **numparents)
                list_free(rel_numparents);
 
        hash_destroy(seen_rels);
+       list_free(rels_list);
 
-       return rels_list;
+       /* List partitioned child tables before non-partitioned ones. */
+       return list_concat(partitioned_rels_list, other_rels_list);
 }
 
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b638271b3..ae8ce71e1c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1282,7 +1282,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
         * the children.
         */
        tableOIDs =
-               find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, 
NULL);
+               find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, 
NULL,
+                                                       NULL);
 
        /*
         * Check that there's at least one descendant, else fail.  This could
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 9fe9e022b0..529f244f7e 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -112,7 +112,7 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
        List       *children;
        ListCell   *lc;
 
-       children = find_inheritance_children(reloid, NoLock);
+       children = find_inheritance_children(reloid, NoLock, NULL);
 
        foreach(lc, children)
        {
diff --git a/src/backend/commands/publicationcmds.c 
b/src/backend/commands/publicationcmds.c
index 610cb499d2..64179ea3ef 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -516,7 +516,7 @@ OpenTableList(List *tables)
                        List       *children;
 
                        children = find_all_inheritors(myrelid, 
ShareUpdateExclusiveLock,
-                                                                               
   NULL);
+                                                                               
   NULL, NULL);
 
                        foreach(child, children)
                        {
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1b8d4b3d17..14bac087d9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1231,7 +1231,8 @@ ExecuteTruncate(TruncateStmt *stmt)
                        ListCell   *child;
                        List       *children;
 
-                       children = find_all_inheritors(myrelid, 
AccessExclusiveLock, NULL);
+                       children = find_all_inheritors(myrelid, 
AccessExclusiveLock, NULL,
+                                                                               
   NULL);
 
                        foreach(child, children)
                        {
@@ -2556,7 +2557,7 @@ renameatt_internal(Oid myrelid,
                 * outside the inheritance hierarchy being processed.
                 */
                child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
-                                                                               
 &child_numparents);
+                                                                               
 &child_numparents, NULL);
 
                /*
                 * find_all_inheritors does the recursive search of the 
inheritance
@@ -2583,7 +2584,7 @@ renameatt_internal(Oid myrelid,
                 * expected_parents will only be 0 if we are not already 
recursing.
                 */
                if (expected_parents == 0 &&
-                       find_inheritance_children(myrelid, NoLock) != NIL)
+                       find_inheritance_children(myrelid, NoLock, NULL) != NIL)
                        ereport(ERROR,
                                        
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
                                         errmsg("inherited column \"%s\" must 
be renamed in child tables too",
@@ -2766,7 +2767,7 @@ rename_constraint_internal(Oid myrelid,
                                           *li;
 
                        child_oids = find_all_inheritors(myrelid, 
AccessExclusiveLock,
-                                                                               
         &child_numparents);
+                                                                               
         &child_numparents, NULL);
 
                        forboth(lo, child_oids, li, child_numparents)
                        {
@@ -2782,7 +2783,7 @@ rename_constraint_internal(Oid myrelid,
                else
                {
                        if (expected_parents == 0 &&
-                               find_inheritance_children(myrelid, NoLock) != 
NIL)
+                               find_inheritance_children(myrelid, NoLock, 
NULL) != NIL)
                                ereport(ERROR,
                                                
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
                                                 errmsg("inherited constraint 
\"%s\" must be renamed in child tables too",
@@ -4790,7 +4791,7 @@ ATSimpleRecursion(List **wqueue, Relation rel,
                ListCell   *child;
                List       *children;
 
-               children = find_all_inheritors(relid, lockmode, NULL);
+               children = find_all_inheritors(relid, lockmode, NULL, NULL);
 
                /*
                 * find_all_inheritors does the recursive search of the 
inheritance
@@ -5186,7 +5187,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, 
Relation rel,
         */
        if (colDef->identity &&
                recurse &&
-               find_inheritance_children(myrelid, NoLock) != NIL)
+               find_inheritance_children(myrelid, NoLock, NULL) != NIL)
                ereport(ERROR,
                                (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
                                 errmsg("cannot recursively add identity column 
to table that has child tables")));
@@ -5392,7 +5393,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, 
Relation rel,
         * routines, we have to do this one level of recursion at a time; we 
can't
         * use find_all_inheritors to do it in one pass.
         */
-       children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+       children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+                                                                               
 NULL);
 
        /*
         * If we are told not to recurse, there had better not be any child
@@ -6511,7 +6513,8 @@ ATExecDropColumn(List **wqueue, Relation rel, const char 
*colName,
         * routines, we have to do this one level of recursion at a time; we 
can't
         * use find_all_inheritors to do it in one pass.
         */
-       children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+       children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+                                                                               
 NULL);
 
        if (children)
        {
@@ -6945,7 +6948,8 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
         * routines, we have to do this one level of recursion at a time; we 
can't
         * use find_all_inheritors to do it in one pass.
         */
-       children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+       children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+                                                                               
 NULL);
 
        /*
         * Check if ONLY was specified with ALTER TABLE.  If so, allow the
@@ -7664,7 +7668,7 @@ ATExecValidateConstraint(Relation rel, char *constrName, 
bool recurse,
                         */
                        if (!recursing && !con->connoinherit)
                                children = 
find_all_inheritors(RelationGetRelid(rel),
-                                                                               
           lockmode, NULL);
+                                                                               
           lockmode, NULL, NULL);
 
                        /*
                         * For CHECK constraints, we must ensure that we only 
mark the
@@ -8547,7 +8551,8 @@ ATExecDropConstraint(Relation rel, const char *constrName,
         * use find_all_inheritors to do it in one pass.
         */
        if (!is_no_inherit_constraint)
-               children = find_inheritance_children(RelationGetRelid(rel), 
lockmode);
+               children = find_inheritance_children(RelationGetRelid(rel), 
lockmode,
+                                                                               
         NULL);
        else
                children = NIL;
 
@@ -8836,7 +8841,7 @@ ATPrepAlterColumnType(List **wqueue,
                ListCell   *child;
                List       *children;
 
-               children = find_all_inheritors(relid, lockmode, NULL);
+               children = find_all_inheritors(relid, lockmode, NULL, NULL);
 
                /*
                 * find_all_inheritors does the recursive search of the 
inheritance
@@ -8887,7 +8892,8 @@ ATPrepAlterColumnType(List **wqueue,
                }
        }
        else if (!recursing &&
-                        find_inheritance_children(RelationGetRelid(rel), 
NoLock) != NIL)
+                        find_inheritance_children(RelationGetRelid(rel),
+                                                                          
NoLock, NULL) != NIL)
                ereport(ERROR,
                                (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
                                 errmsg("type of inherited column \"%s\" must 
be changed in child tables too",
@@ -10997,7 +11003,7 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, 
LOCKMODE lockmode)
         * We use weakest lock we can on child's children, namely 
AccessShareLock.
         */
        children = find_all_inheritors(RelationGetRelid(child_rel),
-                                                                  
AccessShareLock, NULL);
+                                                                  
AccessShareLock, NULL, NULL);
 
        if (list_member_oid(children, RelationGetRelid(parent_rel)))
                ereport(ERROR,
@@ -13503,7 +13509,8 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
         * weaker lock now and the stronger one only when needed.
         */
        attachrel_children = find_all_inheritors(RelationGetRelid(attachrel),
-                                                                               
         AccessExclusiveLock, NULL);
+                                                                               
         AccessExclusiveLock, NULL,
+                                                                               
         NULL);
        if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
                ereport(ERROR,
                                (errcode(ERRCODE_DUPLICATE_TABLE),
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa181207a..e2e5ffce42 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -430,7 +430,8 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)
                oldcontext = MemoryContextSwitchTo(vac_context);
                if (include_parts)
                        oid_list = list_concat(oid_list,
-                                                                  
find_all_inheritors(relid, NoLock, NULL));
+                                                                  
find_all_inheritors(relid, NoLock, NULL,
+                                                                               
                           NULL));
                else
                        oid_list = lappend_oid(oid_list, relid);
                MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/optimizer/prep/prepunion.c 
b/src/backend/optimizer/prep/prepunion.c
index cf46b74782..09e45c2982 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1418,7 +1418,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry 
*rte, Index rti)
                lockmode = AccessShareLock;
 
        /* Scan for all members of inheritance set, acquire needed locks */
-       inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+       inhOIDs = find_all_inheritors(parentOID, lockmode, NULL, NULL);
 
        /*
         * Check that there's at least one descendant, else treat as no-child
diff --git a/src/include/catalog/pg_inherits_fn.h 
b/src/include/catalog/pg_inherits_fn.h
index 7743388899..8f371acae7 100644
--- a/src/include/catalog/pg_inherits_fn.h
+++ b/src/include/catalog/pg_inherits_fn.h
@@ -17,9 +17,10 @@
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
 
-extern List *find_inheritance_children(Oid parentrelId, LOCKMODE lockmode);
+extern List *find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
+                                                 int 
*num_partitioned_children);
 extern List *find_all_inheritors(Oid parentrelId, LOCKMODE lockmode,
-                                       List **parents);
+                                       List **parents, int 
*num_partitioned_children);
 extern bool has_subclass(Oid relationId);
 extern bool has_superclass(Oid relationId);
 extern bool typeInheritsFrom(Oid subclassTypeId, Oid superclassTypeId);
-- 
2.11.0

From 6ae18ec3456b2a3fedd239059687873ae91ddbee Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 9 Aug 2017 15:34:45 +0900
Subject: [PATCH 3/5] Relieve RelationGetPartitionDispatchInfo() of doing any
 locking

Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.
---
 src/backend/catalog/partition.c | 42 ++++++++++++++++++++---------------------
 src/backend/executor/execMain.c | 20 +++++++++++---------
 src/include/catalog/partition.h |  4 ++--
 3 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 465e4fc097..4c16bf143b 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1003,14 +1003,12 @@ get_partition_qual_relid(Oid relid)
  * one member, that is, one for the root partitioned table), *leaf_part_oids
  * contains a list of the OIDs of of all the leaf partitions.
  *
- * Note that we lock only those partitions that are partitioned tables, because
- * we need to look at its relcache entry to get its PartitionKey and its
- * PartitionDesc. It's the caller's responsibility to lock the leaf partitions
- * that will actually be accessed during a given query.
+ * It is assumed that the caller has locked at least all the partitioned tables
+ * in the tree, because we need to look at their relcache entries.
  */
 void
-RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
-                                                                List 
**ptinfos, List **leaf_part_oids)
+RelationGetPartitionDispatchInfo(Relation rel, List **ptinfos,
+                                                                List 
**leaf_part_oids)
 {
        List       *all_parts,
                           *all_parents;
@@ -1025,16 +1023,10 @@ RelationGetPartitionDispatchInfo(Relation rel, int 
lockmode,
         * Starting with the root partitioned table for which we already have 
the
         * relcache entry, we look at its partition descriptor to get the
         * partition OIDs.  For partitions that are themselves partitioned 
tables,
-        * we get their relcache entries after locking them with lockmode and
-        * queue their partitions to be looked at later.  Leaf partitions are
-        * added to the result list without locking.  For each partitioned 
table,
-        * we build a PartitionedTableInfo object and add it to the other result
-        * list.
-        *
-        * Since RelationBuildPartitionDescriptor() puts partitions in a 
canonical
-        * order determined by comparing partition bounds, we can rely that
-        * concurrent backends see the partitions in the same order, ensuring 
that
-        * there are no deadlocks when locking the partitions.
+        * we get their relcache entries and queue their partitions to be 
looked at
+        * later.  For each leaf partition, we simply add its OID to the result
+        * list and for each partitioned table, we build a PartitionedTableInfo
+        * object and add it to the other result list.
         */
        i = offset = 0;
        *ptinfos = *leaf_part_oids = NIL;
@@ -1057,8 +1049,14 @@ RelationGetPartitionDispatchInfo(Relation rel, int 
lockmode,
                        PartitionedTableInfo   *ptinfo;
                        PartitionDispatch               pd;
 
+                       /*
+                        * All the relations in the partition tree must be 
locked
+                        * by the caller.
+                        *
+                        * XXX - Add RelationLockHeldByMe(partrelid) check here!
+                        */
                        if (partrelid != RelationGetRelid(rel))
-                               partrel = heap_open(partrelid, lockmode);
+                               partrel = heap_open(partrelid, NoLock);
                        else
                                partrel = rel;
 
@@ -1077,7 +1075,8 @@ RelationGetPartitionDispatchInfo(Relation rel, int 
lockmode,
                        /*
                         * XXX- do we need a pinning mechanism for partition 
descriptors
                         * so that there references can be managed 
independently of
-                        * the parent relcache entry? Like 
PinPartitionDesc(partdesc)?
+                        * the fate of parent relcache entry?
+                        * Like PinPartitionDesc(partdesc)?
                         */
                        pd->partdesc = partdesc;
 
@@ -1141,10 +1140,9 @@ RelationGetPartitionDispatchInfo(Relation rel, int 
lockmode,
 
                        /*
                         * Release the relation descriptor.  Lock that we have 
on the
-                        * table will keep the PartitionDesc that is pointing 
into
-                        * RelationData intact, a pointer to which hope to keep
-                        * through this transaction's commit.
-                        * (XXX - how true is that?)
+                        * table will keep PartitionDesc (that is pointing into
+                        * RelationData) intact, a reference to which want to 
keep through
+                        * this transaction's commit. (XXX - how true is that?)
                         */
                        if (partrel != rel)
                                heap_close(partrel, NoLock);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0379e489d9..3dd620fc8a 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -43,6 +43,7 @@
 #include "access/xact.h"
 #include "catalog/namespace.h"
 #include "catalog/partition.h"
+#include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_publication.h"
 #include "commands/matview.h"
 #include "commands/trigger.h"
@@ -3250,14 +3251,16 @@ ExecSetupPartitionTupleRouting(Relation rel,
        int                     i;
        ResultRelInfo *leaf_part_rri;
        Relation        parent;
+       List       *all_parts;
 
        /*
-        * Get information about the partition tree.  All the partitioned
-        * tables in the tree are locked, but not the leaf partitions.  We
-        * lock them while building their ResultRelInfos below.
+        * Get information about the partition tree.  First lock all the
+        * partitions using find_all_inheritors().
         */
-       RelationGetPartitionDispatchInfo(rel, RowExclusiveLock,
-                                                                        
&ptinfos, &leaf_parts);
+       all_parts = find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock,
+                                                                       NULL, 
NULL);
+       list_free(all_parts);
+       RelationGetPartitionDispatchInfo(rel, &ptinfos, &leaf_parts);
 
        /*
         * The ptinfos list contains PartitionedTableInfo objects for all the
@@ -3396,11 +3399,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
                TupleDesc       part_tupdesc;
 
                /*
-                * RelationGetPartitionDispatchInfo didn't lock the leaf 
partitions,
-                * so lock here.  Note that each of the relations in 
*partitions are
-                * eventually closed (when the plan is shut down, for instance).
+                * Note that each of the relations in *partitions are eventually
+                * closed (when the plan is shut down, for instance).
                 */
-               partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
+               partrel = heap_open(lfirst_oid(cell), NoLock);  /* already 
locked */
                part_tupdesc = RelationGetDescr(partrel);
 
                /*
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 6a0c81b3bd..9e63020c82 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -72,8 +72,8 @@ extern List *map_partition_varattnos(List *expr, int 
target_varno,
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
-extern void RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
-                                                                List 
**ptinfos, List **leaf_part_oids);
+extern void RelationGetPartitionDispatchInfo(Relation rel, List **ptinfos,
+                                                                List 
**leaf_part_oids);
 
 /* For tuple routing */
 extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
-- 
2.11.0

From f09d3f00861b47fcb36e20f43be0b718e3350ab5 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 4/5] Teach expand_inherited_rtentry to use partition bound
 order

After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
 src/backend/optimizer/prep/prepunion.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/src/backend/optimizer/prep/prepunion.c 
b/src/backend/optimizer/prep/prepunion.c
index 09e45c2982..71a0daa1b0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -1446,6 +1447,37 @@ expand_inherited_rtentry(PlannerInfo *root, 
RangeTblEntry *rte, Index rti)
         */
        oldrelation = heap_open(parentOID, NoLock);
 
+       /*
+        * For partitioned tables, we arrange the child table OIDs such that 
they
+        * appear in the partition bound order.
+        */
+       if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+       {
+               List    *ptinfos,
+                               *leaf_part_oids;
+
+               /* Discard the original list. */
+               list_free(inhOIDs);
+               inhOIDs = NIL;
+
+               /* Request partitioning information. */
+               RelationGetPartitionDispatchInfo(oldrelation, &ptinfos,
+                                                                               
 &leaf_part_oids);
+               /*
+                * First collect the partitioned child table OIDs, which 
includes the
+                * root parent at the head.
+                */
+               foreach(l, ptinfos)
+               {
+                       PartitionedTableInfo *ptinfo = lfirst(l);
+
+                       inhOIDs = lappend_oid(inhOIDs, ptinfo->relid);
+               }
+
+               /* Concatenate the leaf partition OIDs. */
+               inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+       }
+
        /* Scan the inheritance set and expand it */
        appinfos = NIL;
        need_append = false;
-- 
2.11.0

From 704b0877170757deae269b6bababbb2487693a4b Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 9 Aug 2017 16:53:47 +0900
Subject: [PATCH 5/5] Store in pg_inherits if a child is a partitioned table

---
 doc/src/sgml/catalogs.sgml        | 10 ++++++++++
 src/backend/catalog/pg_inherits.c | 14 +++++++-------
 src/backend/commands/tablecmds.c  | 17 +++++++++++------
 src/include/catalog/pg_inherits.h |  4 +++-
 4 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97e5ecf686..eae9b77ccb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3896,6 +3896,16 @@ SCRAM-SHA-256$<replaceable>&lt;iteration 
count&gt;</>:<replaceable>&lt;salt&gt;<
        inherited columns are to be arranged.  The count starts at 1.
       </entry>
      </row>
+
+     <row>
+      <entry><structfield>inhchildparted</structfield></entry>
+      <entry><type>bool</type></entry>
+      <entry></entry>
+      <entry>
+       This is <literal>true</> if the child table is a partitioned table,
+       <literal>false</> otherwise
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/pg_inherits.c 
b/src/backend/catalog/pg_inherits.c
index 99b1e70de6..0285bc3c33 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -30,7 +30,6 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/memutils.h"
-#include "utils/lsyscache.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
@@ -123,7 +122,12 @@ find_inheritance_children(Oid parentrelId, LOCKMODE 
lockmode,
 
        while ((inheritsTuple = systable_getnext(scan)) != NULL)
        {
+               bool    is_partitioned;
+
                inhrelid = ((Form_pg_inherits) 
GETSTRUCT(inheritsTuple))->inhrelid;
+               is_partitioned = ((Form_pg_inherits)
+                                                               
GETSTRUCT(inheritsTuple))->inhchildparted;
+
                if (numchildren >= maxchildren)
                {
                        maxchildren *= 2;
@@ -131,14 +135,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE 
lockmode,
                                                                                
maxchildren * sizeof(InhChildInfo));
                }
                inhchildren[numchildren].relid = inhrelid;
+               inhchildren[numchildren].is_partitioned = is_partitioned;
 
-               if (get_rel_relkind(inhrelid) == RELKIND_PARTITIONED_TABLE)
-               {
-                       inhchildren[numchildren].is_partitioned = true;
+               if (is_partitioned)
                        my_num_partitioned_children++;
-               }
-               else
-                       inhchildren[numchildren].is_partitioned = false;
                numchildren++;
        }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 14bac087d9..ab3cbbcdba 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -299,10 +299,10 @@ static bool MergeCheckConstraint(List *constraints, char 
*name, Node *expr);
 static void MergeAttributesIntoExisting(Relation child_rel, Relation 
parent_rel);
 static void MergeConstraintsIntoExisting(Relation child_rel, Relation 
parent_rel);
 static void StoreCatalogInheritance(Oid relationId, List *supers,
-                                               bool child_is_partition);
+                                               bool child_is_partition, bool 
child_is_partitioned);
 static void StoreCatalogInheritance1(Oid relationId, Oid parentOid,
                                                 int16 seqNumber, Relation 
inhRelation,
-                                                bool child_is_partition);
+                                                bool child_is_partition, bool 
child_is_partitioned);
 static int     findAttrByName(const char *attributeName, List *schema);
 static void AlterIndexNamespaces(Relation classRel, Relation rel,
                                         Oid oldNspOid, Oid newNspOid, 
ObjectAddresses *objsMoved);
@@ -746,7 +746,8 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
                                                                                
  typaddress);
 
        /* Store inheritance information for new rel. */
-       StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != 
NULL);
+       StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != 
NULL,
+                                                       relkind == 
RELKIND_PARTITIONED_TABLE);
 
        /*
         * We must bump the command counter to make the newly-created relation
@@ -2298,7 +2299,7 @@ MergeCheckConstraint(List *constraints, char *name, Node 
*expr)
  */
 static void
 StoreCatalogInheritance(Oid relationId, List *supers,
-                                               bool child_is_partition)
+                                               bool child_is_partition, bool 
child_is_partitioned)
 {
        Relation        relation;
        int16           seqNumber;
@@ -2329,7 +2330,7 @@ StoreCatalogInheritance(Oid relationId, List *supers,
                Oid                     parentOid = lfirst_oid(entry);
 
                StoreCatalogInheritance1(relationId, parentOid, seqNumber, 
relation,
-                                                                
child_is_partition);
+                                                                
child_is_partition, child_is_partitioned);
                seqNumber++;
        }
 
@@ -2343,7 +2344,7 @@ StoreCatalogInheritance(Oid relationId, List *supers,
 static void
 StoreCatalogInheritance1(Oid relationId, Oid parentOid,
                                                 int16 seqNumber, Relation 
inhRelation,
-                                                bool child_is_partition)
+                                                bool child_is_partition, bool 
child_is_partitioned)
 {
        TupleDesc       desc = RelationGetDescr(inhRelation);
        Datum           values[Natts_pg_inherits];
@@ -2358,6 +2359,8 @@ StoreCatalogInheritance1(Oid relationId, Oid parentOid,
        values[Anum_pg_inherits_inhrelid - 1] = ObjectIdGetDatum(relationId);
        values[Anum_pg_inherits_inhparent - 1] = ObjectIdGetDatum(parentOid);
        values[Anum_pg_inherits_inhseqno - 1] = Int16GetDatum(seqNumber);
+       values[Anum_pg_inherits_inhchildparted - 1] =
+                                                                       
BoolGetDatum(child_is_partitioned);
 
        memset(nulls, 0, sizeof(nulls));
 
@@ -11112,6 +11115,8 @@ CreateInheritance(Relation child_rel, Relation 
parent_rel)
                                                         inhseqno + 1,
                                                         catalogRelation,
                                                         
parent_rel->rd_rel->relkind ==
+                                                        
RELKIND_PARTITIONED_TABLE,
+                                                        
child_rel->rd_rel->relkind ==
                                                         
RELKIND_PARTITIONED_TABLE);
 
        /* Now we're done with pg_inherits */
diff --git a/src/include/catalog/pg_inherits.h 
b/src/include/catalog/pg_inherits.h
index 26bfab5db6..2c4ef246a4 100644
--- a/src/include/catalog/pg_inherits.h
+++ b/src/include/catalog/pg_inherits.h
@@ -33,6 +33,7 @@ CATALOG(pg_inherits,2611) BKI_WITHOUT_OIDS
        Oid                     inhrelid;
        Oid                     inhparent;
        int32           inhseqno;
+       bool            inhchildparted;
 } FormData_pg_inherits;
 
 /* ----------------
@@ -46,10 +47,11 @@ typedef FormData_pg_inherits *Form_pg_inherits;
  *             compiler constants for pg_inherits
  * ----------------
  */
-#define Natts_pg_inherits                              3
+#define Natts_pg_inherits                              4
 #define Anum_pg_inherits_inhrelid              1
 #define Anum_pg_inherits_inhparent             2
 #define Anum_pg_inherits_inhseqno              3
+#define Anum_pg_inherits_inhchildparted        4
 
 /* ----------------
  *             pg_inherits has no initial contents
-- 
2.11.0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to