On Fri, Aug 4, 2017 at 10:55 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> The current way to expand inherited tables, including partitioned tables,
>> is to use either find_all_inheritors() or find_inheritance_children()
>> depending on the context.  They return child table OIDs in the (ascending)
>> order of those OIDs, which means the callers that need to lock the child
>> tables can do so without worrying about the possibility of deadlock in
>> some concurrent execution of that piece of code.  That's good.
>>
>> For partitioned tables, there is a possibility of returning child table
>> (partition) OIDs in the partition bound order, which in addition to
>> preventing the possibility of deadlocks during concurrent locking, seems
>> potentially useful for other caller-specific optimizations.  For example,
>> tuple-routing code can utilize that fact to implement binary-search based
>> partition-searching algorithm.  For one more example, refer to the "UPDATE
>> partition key" thread where it's becoming clear that it would be nice if
>> the planner had put the partitions in bound order in the ModifyTable that
>> it creates for UPDATE of partitioned tables [1].
>
> I guess I don't really understand why we want to change the locking
> order.  That is bound to make expanding the inheritance hierarchy more
> expensive. If we use this approach in all cases, it seems to me we're
> bound to reintroduce the problem we fixed in commit
> c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
> the same vein.

I initially didn't understand this, but I think now I understand it.
Establishing the order of children by partition bounds requires
building the relcache entry right now. That's what is expensive and
would introduce the same problems as the commit you have quoted.

> But I don't see that there's any necessary relation
> between the order of locking and the order of expansion inside the
> relcache entry/plan/whatever else -- so I propose that we keep the
> existing locking order and only change the other stuff.
>
> While reading related code this morning, I noticed that
> RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
> *already* changed the locking order for certain operations, because
> the PartitionDesc's OID array is bound-ordered not OID-ordered.  That
> means that when RelationGetPartitionDispatchInfo uses the
> PartitionDesc's OID arra to figure out what to lock, it's potentially
> going to encounter partitions in a different order than would have
> been the case if it had used find_all_inheritors directly.  I'm
> tempted to think that RelationGetPartitionDispatchInfo shouldn't
> really be doing locking at all.  The best way to have the locking
> always happen in the same order is to have only one piece of code that
> determines that order - and I vote for find_all_inheritors.  Aside
> from the fact that it's the way we've always done it (and still do it
> in most other places), that code includes infinite-loop defenses which
> the new code you've introduced lacks.
>

+1.

> Concretely, my proposal is:
>
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).
>
> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
>
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.
>
> 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
> to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
> elog(ERROR, ...).  Might be interesting to stuff that check into the
> relation_open(..., NoLock) path, too.
>
> One objection to this line of attack is that there might be a good
> case for locking only the partitioned inheritors first and then going
> back and locking the leaf nodes in a second pass, or even only when
> required for a particular row.  However, that doesn't require putting
> everything in bound order - it only requires moving the partitioned
> children to the beginning of the list.  And I think rather than having
> new logic for that, we should teach find_inheritance_children() to do
> that directly.  I have a feeling Ashutosh is going to cringe at this
> suggestion, but my idea is to do this by denormalizing: add a column
> to pg_inherits indicating whether the child is of
> RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
> pg_inherits, it can pull that flag out for free along with the
> relation OID, and qsort() first by the flag and then by the OID.  It
> can also return the number of initial elements of its return value
> which have that flag set.

I am always uncomfortable, when we save the same information in two
places without having a way to make sure that they are in sync. That
means we have to add explicit code to make sure that that information
is kept in sync. Somebody forgetting to add that code wherever
necessary means we have contradictory information persisted in the
databases without an idea of which of them is correct. Not necessarily
in this case, but usually it is an indication of something going wrong
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.

>
> Then, in find_all_inheritors, we split rels_list into
> partitioned_rels_list and other_rels_list, and process
> partitioned_rels_list in its entirety before touching other_rels_list;
> they get concatenated at the end.
>
> Now, find_all_inheritors and find_inheritance_children can also grow a
> flag bool only_partitioned_children; if set, then we skip the
> unpartitioned children entirely.
>
> With all that in place, you can call find_all_inheritors(blah blah,
> false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
> true) to lock just the partitioned tables in the hierarchy.  You get a
> consistent lock order either way, and if you start with only the
> partitioned tables and later want the leaf partitions too, you just go
> through the partitioned children in the order they were returned and
> find_inheritance_children(blah blah, false) on each one of them and
> the lock order is exactly consistent with what you would have gotten
> if you'd done find_all_inheritors(blah blah, false) originally.
>
> Thoughts?

I noticed that find_all_inheritors() uses a hash table to eliminate
duplicates arising out of multiple inheritance. Partition hierarchy is
never going to have multiple inheritance, and doesn't need to
eliminate duplicates and so doesn't need the hash table. It will be
good, if we can eliminate that overhead. But that's separate task than
what this thread is about.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to