On 5 July 2017 at 15:12, Amit Khandekar <amitdkhan...@gmail.com> wrote: > Like I mentioned upthread... in expand_inherited_rtentry(), if we > replace find_all_inheritors() with something else that returns oids in > canonical order, that will change the order in which children tables > get locked, which increases the chance of deadlock. Because, then the > callers of find_all_inheritors() will lock them in one order, while > callers of expand_inherited_rtentry() will lock them in a different > order. Even in the current code, I think there is a chance of > deadlocks because RelationGetPartitionDispatchInfo() and > find_all_inheritors() have different lock ordering. > > Now, to get the oids of a partitioned table children sorted by > canonical ordering, (i.e. using the partition bound values) we need to > either use the partition bounds to sort the oids like the way it is > done in RelationBuildPartitionDesc() or, open the parent table and get > it's Relation->rd_partdesc->oids[] which are already sorted in > canonical order. So if we generate oids using this way in > find_all_inheritors() and find_inheritance_children(), that will > generate consistent ordering everywhere. But this method is quite > expensive as compared to the way oids are generated and sorted using > oid values in find_inheritance_children(). > > In both expand_inherited_rtentry() and > RelationGetPartitionDispatchInfo(), each of the child tables are > opened. > > So, in both of these functions, what we can do is : call a new > function partition_tree_walker() which does following : > 1. Lock the children using the existing order (i.e. sorted by oid > values) using the same function find_all_inheritors(). Rename > find_all_inheritors() to lock_all_inheritors(... , bool return_oids) > which returns the oid list only if requested. > 2. And then scan through each of the partitions in canonical order, by > opening the parent table, then opening the partition descriptor oids, > and then doing whatever needs to be done with that partition rel. > > partition_tree_walker() will look something like this : > > void partition_tree_walker(Oid parentOid, LOCKMODE lockmode, > void (*walker_func) (), void *context) > { > Relation parentrel; > List *rels_list; > ListCell *cell; > > (void) lock_all_inheritors(parentOid, lockmode, > false /* don't generate oids */); > > parentrel = heap_open(parentOid, NoLock); > rels_list = append_rel_partition_oids(NIL, parentrel); > > /* Scan through all partitioned rels, and at the > * same time append their children. */ > foreach(cell, rels_list) > { > /* Open partrel without locking; lock_all_inheritors() has locked it > */ > Relation partrel = heap_open(lfirst_oid(cell), NoLock); > > /* Append the children of a partitioned rel to the same list > * that we are iterating on */ > if (RelationGetPartitionDesc(partrel)) > rels_list = append_rel_partition_oids(rels_list, partrel); > > /* > * Do whatever processing needs to be done on this partel. > * The walker function is free to either close the partel > * or keep it opened, but it needs to make sure the opened > * ones are closed later > */ > walker_func(partrel, context); > } > } > > List *append_rel_partition_oids(List *rel_list, Relation rel) > { > int i; > for (i = 0; i < rel->rd_partdesc->nparts; i++) > rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]); > > return rel_list; > } > > > So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be > replaced by partition_tree_walker(parentOid, expand_rte_walker_func) > where expand_rte_walker_func() will do all the work done in the for > loop for each of the partition rels. > > Similarly, in RelationGetPartitionDispatchInfo() the initial part > where it uses APPEND_REL_PARTITION_OIDS() can be replaced by > partition_tree_walker(rel, dispatch_info_walkerfunc) where > dispatch_info_walkerfunc() will generate the oids, or may be populate > the complete PartitionDispatchData structure. 'pd' variable can be > passed as context to the partition_tree_walker(..., context) > > Generating the resultrels in canonical order by opening the tables > using the above way wouldn't be more expensive than the existing code, > because even currently we anyways have to open all the tables in both > of these functions. >
Attached is a WIP patch (make_resultrels_ordered.patch) that generates the result rels in canonical order. This patch is kept separate from the update-partition-key patch, and can be applied on master branch. In this patch, rather than partition_tree_walker() called with a context, I have provided a function partition_walker_next() using which we iterate over all the partitions in canonical order. partition_walker_next() will take care of appending oids from partition descriptors. Now, to generate consistent oid ordering in RelationGetPartitionDispatchInfo() and expand_inherited_rtentry(), we could have very well skipped using the partition_walker API in expand_inherited_rtentry() and just had it iterate over the partition descriptors the way it is done in RelationGetPartitionDispatchInfo(). But I think it's better to have some common function to traverse the partition tree in consistent order, hence the usage of partition_walker_next() in both expand_inherited_rtentry() and RelationGetPartitionDispatchInfo(). In RelationGetPartitionDispatchInfo(), still, it only uses this function to generate partitioned table list. But even to generate partitioned tables in correct order, it is better to use partition_walker_next(), so that we make sure to finally generate consistent order of leaf oids. I considered the option where RelationGetPartitionDispatchInfo() would directly build the pd[] array over each iteration of partition_walker_next(). But that was turning out to be clumsy, because then we need to keep track of which pd[] element each of the oids would go into by having a current position of pd[]. Rather than this, it is best to keep building of pd array separate, as done in the existing code. Didn't do any renaming for find_all_inheritors(). Just called it in both the functions, and ignored the list returned. Like mentioned upthread, it is important to lock in this order so as to be consistent with the lock ordering in other places where find_inheritance_children() is called. Hence, called find_all_inheritors() in RelationGetPartitionDispatchInfo() as well. Note that this patch does not attempt to make RelationGetPartitionDispatchInfo() work in planner. That I think should be done once we finalise how to generate common oid ordering, and is not in the scope of this project. Once I merge this in the update-partition-key patch, in ExecSetupPartitionTupleRouting(), I will be able to search for the leaf partitions in this ordered resultrel list, without having to build a hash table of result rels the way it is currently done in the update-partition-key patch. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
make_resultrels_ordered.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers