On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote: > The current way to expand inherited tables, including partitioned tables, > is to use either find_all_inheritors() or find_inheritance_children() > depending on the context. They return child table OIDs in the (ascending) > order of those OIDs, which means the callers that need to lock the child > tables can do so without worrying about the possibility of deadlock in > some concurrent execution of that piece of code. That's good. > > For partitioned tables, there is a possibility of returning child table > (partition) OIDs in the partition bound order, which in addition to > preventing the possibility of deadlocks during concurrent locking, seems > potentially useful for other caller-specific optimizations. For example, > tuple-routing code can utilize that fact to implement binary-search based > partition-searching algorithm. For one more example, refer to the "UPDATE > partition key" thread where it's becoming clear that it would be nice if > the planner had put the partitions in bound order in the ModifyTable that > it creates for UPDATE of partitioned tables [1].
I guess I don't really understand why we want to change the locking order. That is bound to make expanding the inheritance hierarchy more expensive. If we use this approach in all cases, it seems to me we're bound to reintroduce the problem we fixed in commit c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in the same vein. But I don't see that there's any necessary relation between the order of locking and the order of expansion inside the relcache entry/plan/whatever else -- so I propose that we keep the existing locking order and only change the other stuff. While reading related code this morning, I noticed that RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have *already* changed the locking order for certain operations, because the PartitionDesc's OID array is bound-ordered not OID-ordered. That means that when RelationGetPartitionDispatchInfo uses the PartitionDesc's OID arra to figure out what to lock, it's potentially going to encounter partitions in a different order than would have been the case if it had used find_all_inheritors directly. I'm tempted to think that RelationGetPartitionDispatchInfo shouldn't really be doing locking at all. The best way to have the locking always happen in the same order is to have only one piece of code that determines that order - and I vote for find_all_inheritors. Aside from the fact that it's the way we've always done it (and still do it in most other places), that code includes infinite-loop defenses which the new code you've introduced lacks. Concretely, my proposal is: 1. Before calling RelationGetPartitionDispatchInfo, the calling code should use find_all_inheritors to lock all the relevant relations (or the planner could use find_all_inheritors to get a list of relation OIDs, store it in the plan in order, and then at execution time we visit them in that order and lock them). 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. 3. While we're optimizing, in the first loop inside of RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, use get_rel_relkind() to see whether we've got a partitioned table; if so, open it. If not, there's no need. 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1)) elog(ERROR, ...). Might be interesting to stuff that check into the relation_open(..., NoLock) path, too. One objection to this line of attack is that there might be a good case for locking only the partitioned inheritors first and then going back and locking the leaf nodes in a second pass, or even only when required for a particular row. However, that doesn't require putting everything in bound order - it only requires moving the partitioned children to the beginning of the list. And I think rather than having new logic for that, we should teach find_inheritance_children() to do that directly. I have a feeling Ashutosh is going to cringe at this suggestion, but my idea is to do this by denormalizing: add a column to pg_inherits indicating whether the child is of RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans pg_inherits, it can pull that flag out for free along with the relation OID, and qsort() first by the flag and then by the OID. It can also return the number of initial elements of its return value which have that flag set. Then, in find_all_inheritors, we split rels_list into partitioned_rels_list and other_rels_list, and process partitioned_rels_list in its entirety before touching other_rels_list; they get concatenated at the end. Now, find_all_inheritors and find_inheritance_children can also grow a flag bool only_partitioned_children; if set, then we skip the unpartitioned children entirely. With all that in place, you can call find_all_inheritors(blah blah, false) to lock the whole hierarchy, or find_all_inheritors(blah blah, true) to lock just the partitioned tables in the hierarchy. You get a consistent lock order either way, and if you start with only the partitioned tables and later want the leaf partitions too, you just go through the partitioned children in the order they were returned and find_inheritance_children(blah blah, false) on each one of them and the lock order is exactly consistent with what you would have gotten if you'd done find_all_inheritors(blah blah, false) originally. Thoughts? P.S. While I haven't reviewed 0002 in detail, I think the concept of minimizing what needs to be built in RelationGetPartitionDispatchInfo is a very good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers