On 5 July 2017 at 15:12, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> Like I mentioned upthread... in expand_inherited_rtentry(), if we
> replace find_all_inheritors() with something else that returns oids in
> canonical order, that will change the order in which children tables
> get locked, which increases the chance of deadlock. Because, then the
> callers of find_all_inheritors() will lock them in one order, while
> callers of expand_inherited_rtentry() will lock them in a different
> order. Even in the current code, I think there is a chance of
> deadlocks because RelationGetPartitionDispatchInfo() and
> find_all_inheritors() have different lock ordering.
> Now, to get the oids of a partitioned table children sorted by
> canonical ordering, (i.e. using the partition bound values) we need to
> either use the partition bounds to sort the oids like the way it is
> done in RelationBuildPartitionDesc() or, open the parent table and get
> it's Relation->rd_partdesc->oids[] which are already sorted in
> canonical order. So if we generate oids using this way in
> find_all_inheritors() and find_inheritance_children(), that will
> generate consistent ordering everywhere. But this method is quite
> expensive as compared to the way oids are generated and sorted using
> oid values in find_inheritance_children().
> In both expand_inherited_rtentry() and
> RelationGetPartitionDispatchInfo(), each of the child tables are
> opened.
> So, in both of these functions, what we can do is : call a new
> function partition_tree_walker() which does following :
> 1. Lock the children using the existing order (i.e. sorted by oid
> values) using the same function find_all_inheritors(). Rename
> find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
> which returns the oid list only if requested.
> 2. And then scan through each of the partitions in canonical order, by
> opening the parent table, then opening the partition descriptor oids,
> and then doing whatever needs to be done with that partition rel.
> partition_tree_walker() will look something like this :
> void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
>                        void (*walker_func) (), void *context)
> {
>     Relation parentrel;
>     List *rels_list;
>     ListCell *cell;
>     (void) lock_all_inheritors(parentOid, lockmode,
>                            false /* don't generate oids */);
>     parentrel = heap_open(parentOid, NoLock);
>     rels_list = append_rel_partition_oids(NIL, parentrel);
>     /* Scan through all partitioned rels, and at the
>      * same time append their children. */
>     foreach(cell, rels_list)
>     {
>         /* Open partrel without locking; lock_all_inheritors() has locked it 
> */
>         Relation    partrel = heap_open(lfirst_oid(cell), NoLock);
>         /* Append the children of a partitioned rel to the same list
>          * that we are iterating on */
>         if (RelationGetPartitionDesc(partrel))
>             rels_list = append_rel_partition_oids(rels_list, partrel);
>         /*
>          * Do whatever processing needs to be done on this partel.
>          * The walker function is free to either close the partel
>          * or keep it opened, but it needs to make sure the opened
>          * ones are closed later
>          */
>         walker_func(partrel, context);
>     }
> }
> List *append_rel_partition_oids(List *rel_list, Relation rel)
> {
>     int i;
>     for (i = 0; i < rel->rd_partdesc->nparts; i++)
>         rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);
>     return rel_list;
> }
> So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
> replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
> where expand_rte_walker_func() will do all the work done in the for
> loop for each of the partition rels.
> Similarly, in RelationGetPartitionDispatchInfo() the initial part
> where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
> partition_tree_walker(rel, dispatch_info_walkerfunc) where
> dispatch_info_walkerfunc() will generate the oids, or may be populate
> the complete PartitionDispatchData structure. 'pd' variable can be
> passed as context to the partition_tree_walker(..., context)
> Generating the resultrels in canonical order by opening the tables
> using the above way wouldn't be more expensive than the existing code,
> because even currently we anyways have to open all the tables in both
> of these functions.

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

In this patch, rather than partition_tree_walker() called with a
context, I have provided a function partition_walker_next() using
which we iterate over all the partitions in canonical order.
partition_walker_next() will take care of appending oids from
partition descriptors.

Now, to generate consistent oid ordering in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry(), we
could have very well skipped using the partition_walker API in
expand_inherited_rtentry() and just had it iterate over the partition
descriptors the way it is done in  RelationGetPartitionDispatchInfo().
But I think it's better to have some common function to traverse the
partition tree in consistent order, hence the usage of
partition_walker_next() in both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(). In
RelationGetPartitionDispatchInfo(), still, it only uses this function
to generate partitioned table list. But even to generate partitioned
tables in correct order, it is better to use partition_walker_next(),
so that we make sure to finally generate consistent order of leaf

I considered the option where RelationGetPartitionDispatchInfo() would
directly build the pd[] array over each iteration of
partition_walker_next(). But that was turning out to be clumsy,
because then we need to keep track of which pd[] element each of the
oids would go into by having a current position of pd[]. Rather than
this, it is best to keep building of pd array separate, as done in the
existing code.

Didn't do any renaming for find_all_inheritors(). Just called it in
both the functions, and ignored the list returned. Like mentioned
upthread, it is important to lock in this order  so as to be
consistent with the lock ordering in other places where
find_inheritance_children() is called. Hence, called
find_all_inheritors() in RelationGetPartitionDispatchInfo() as well.

Note that this patch does not attempt to make
RelationGetPartitionDispatchInfo() work in planner. That I think
should be done once we finalise how to generate common oid ordering,
and is not in the scope of this project.

Once I merge this in the update-partition-key patch, in
ExecSetupPartitionTupleRouting(), I will be able to search for the
leaf partitions in this ordered resultrel list, without having to
build a hash table of result rels the way it is currently done in the
update-partition-key patch.

-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment: make_resultrels_ordered.patch
Description: Binary data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to