On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> The current way to expand inherited tables, including partitioned tables,
> is to use either find_all_inheritors() or find_inheritance_children()
> depending on the context.  They return child table OIDs in the (ascending)
> order of those OIDs, which means the callers that need to lock the child
> tables can do so without worrying about the possibility of deadlock in
> some concurrent execution of that piece of code.  That's good.
> For partitioned tables, there is a possibility of returning child table
> (partition) OIDs in the partition bound order, which in addition to
> preventing the possibility of deadlocks during concurrent locking, seems
> potentially useful for other caller-specific optimizations.  For example,
> tuple-routing code can utilize that fact to implement binary-search based
> partition-searching algorithm.  For one more example, refer to the "UPDATE
> partition key" thread where it's becoming clear that it would be nice if
> the planner had put the partitions in bound order in the ModifyTable that
> it creates for UPDATE of partitioned tables [1].

I guess I don't really understand why we want to change the locking
order.  That is bound to make expanding the inheritance hierarchy more
expensive.  If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein.  But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.

While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered.  That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly.  I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all.  The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors.  Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.

Concretely, my proposal is:

1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).

2. RelationGetPartitionDispatchInfo assumes the relations are already locked.

3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it.  If not, there's no need.

4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...).  Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.

One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row.  However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list.  And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly.  I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID.  It
can also return the number of initial elements of its return value
which have that flag set.

Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.

Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.

With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy.  You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.


P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to