Re: [HACKERS] path toward faster partition pruning

Amit Langote Wed, 30 Aug 2017 20:18:40 -0700

Hi Ashutosh,

Thanks for the comments and sorry that it took me a while to reply here.


On 2017/08/23 20:16, Ashutosh Bapat wrote:
> On Mon, Aug 21, 2017 at 12:07 PM, Amit Langote
> <[email protected]> wrote:
>> I've been working on implementing a way to perform plan-time
>> partition-pruning that is hopefully faster than the current method of
>> using constraint exclusion to prune each of the potentially many
>> partitions one-by-one.  It's not fully cooked yet though.
>>
>> Meanwhile, I thought I'd share a couple of patches that implement some
>> restructuring of the planner code related to partitioned table inheritance
>> planning that I think would be helpful.  They are to be applied on top of
>> the patches being discussed at [1].  Note that these patches themselves
>> don't implement the actual code that replaces constraint exclusion as a
>> method of performing partition pruning.  I will share that patch after
>> debugging it some more.
>>
>> The main design goal of the patches I'm sharing here now is to defer the
>> locking and  opening of leaf partitions in a given partition tree to a
>> point after set_append_rel_size() is called on the root partitioned table.
>>  Currently, AFAICS, we need to lock and open the child tables in
>> expand_inherited_rtentry() only to set the translated_vars field in
>> AppendRelInfo that we create for the child.  ISTM, we can defer the
>> creation of a child AppendRelInfo to a point when it (its field
>> translated_vars in particular) will actually be used and so lock and open
>> the child tables only at such a time.  Although we don't lock and open the
>> partition child tables in expand_inherited_rtentry(), their RT entries are
>> still created and added to root->parse->rtable, so that
>> setup_simple_rel_arrays() knows the maximum number of entries
>> root->simple_rel_array will need to hold and allocate the memory for that
>> array accordingly.   Slots in simple_rel_array[] corresponding to
>> partition child tables will be empty until they are created when
>> set_append_rel_size() is called on the root parent table and it determines
>> the partitions that will be scanned after all.
> 
> The partition pruning can happen only after the quals have been
> distributed to Rels i.e. after deconstruct_jointree(),
> reconsider_outer_join_clauses() and generate_base_implied_equalities()
> have been called. If the goal is to not heap_open() the partitions
> which are pruned, we can't do that in expand_inherited_rtentry(). One
> reason why I think we don't want to heap_open() partition relations is
> to avoid relcache bloat because of opened partition relations, which
> are ultimately pruned. But please note that according to your patches,
> we still need to populate catalog caches to get relkind and reltype
> etc.

Yes, we still hit syscache for *all* partitions.  I haven't yet thought
very hard about avoiding that altogether.

> There are many functions that traverse simple_rel_array[] after it's
> created. Most of them assume that the empty entries in that array
> correspond to non-simple range entries like join RTEs. But now we are
> breaking that assumption. Most of these functions also skip "other"
> relations, so that may be OK now, but I am not sure if it's really
> going to be fine if we keep empty slots in place of partition
> relations. There may be three options here 1. add placeholder
> RelOptInfos for partition relations (may be mark those specially) and
> mark the ones which get pruned as dummy later. 2. Prune the partitions
> before any functions scans simple_rel_array[] or postpone creating
> simple_rel_array till pruning. 3. Examine all the current scanners
> esp. the ones which will be called before pruning to make sure that
> skipping "other" relations is going to be kosher.

Between the point when slots in simple_rel_array are allocated
(setup_simple_rel_arrays) and partition RelOptInfos are actually created
after the partition-pruning step has occurred (set_append_rel_size), it
seems that most places that iterate over simple_rel_array know also to
skip slots containing NULL values.  We might need to document that NULL
means partitions in addition to its current meaning - non-baserels.

>> Patch augments the existing PartitionedChildRelInfo node, which currently
>> holds only the partitioned child rel RT indexes, to carry some more
>> information about the partition tree, which includes the information
>> returned by RelationGetPartitionDispatchInfo() when it is called from
>> expand_inherited_rtentry() (per the proposed patch in [1], we call it to
>> be able to add partitions to the query tree in the bound order).
>> Actually, since PartitionedChildRelInfo now contains more information
>> about the partition tree than it used to before, I thought the struct's
>> name is no longer relevant, so renamed it to PartitionRootInfo and renamed
>> root->pcinfo_list accordingly to prinfo_list.  That seems okay because we
>> only use that node internally.
>>
>> Then during the add_base_rels_to_query() step, when build_simple_rel()
>> builds a RelOptInfo for the root partitioned table, it also initializes
>> some newly introduced fields in RelOptInfo from the information contained
>> in PartitionRootInfo of the table.  The aforementioned fields are only
>> initialized in RelOptInfos of root partitioned tables.  Note that the
>> add_base_rels_to_query() step won't add the partition "otherrel"
>> RelOptInfos yet (unlike the regular inheritance case, where they are,
>> after looking them up in root->append_rel_list).
> 
> Partition-wise join requires the partition hierarchy to be expanded
> level-by-level keeping in-tact the parent-child relationship between
> partitioned table and its partitions. Your patch doesn't do that and
> adds all the partitioning information in the root partitioned table's
> RelOptInfo. OTOH, partition-wise join patch adds partition bounds, key
> expressions, OID and RelOptInfos of the immediate partitions
> (including partitioned partitions) to RelOptInfo of a partitioned
> table (see patch 0002 in the latest set of patches at [1]). I don't
> see much point in having conflicting changes in both of our patches.
> May be you should review that patch from my set  and we can find a set
> of members which help both partition pruning and partition-wise join.

Yes, I think it would be a good idea for the partition-pruning patch to
initialize those fields in the individual parents' RelOptInfos.  I will
review relevant patches in the partitionwise-join thread to see what can
be incorporated here.

>> When set_append_rel_size() is called on the root partitioned table, it
>> will call a find_partitions_for_query(), which using the partition tree
>> information, determines the partitions that will need to be scanned for
>> the query.  This processing happens recursively, that is, we first
>> determine the root-parent's partitions and then for each partition that's
>> partitioned, we will determine its partitions and so on.  As we determine
>> partitions in this per-partitioned-table manner, we maintain a pair
>> (parent_relid, list-of-partition-relids-to-scan) for each partitioned
>> table and also a single list of all leaf partitions determined so far.
>> Once all partitions have been determined, we turn to locking the leaf
>> partitions.  The locking happens in the order of OIDs as
>> find_all_inheritors would have returned in expand_inherited_rtentry(); the
>> list of OIDs in that original order is also stored in the table's
>> PartitionRootInfo node.  For each OID in that list, check if that OID is
>> in the set of leaf partition OIDs that was just computed, and if so, lock
>> it.  For all chosen partitions that are partitioned tables (including the
>> root), we create a PartitionAppendInfo node which stores the
>> aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and
>> append it to a list in the root table's RelOptInfo, with the root table's
>> PartitionAppendInfo at the head of the list.  Note that the list of
>> partitions in this pair contains only the immediate partitions, so that
>> the original parent-child relationship is reflected in the list of
>> PartitionAppendInfos thus collected.  The next patch that will implement
>> actual partition-pruning will add some more code that will run under
>> find_partitions_for_query().
>>
>> set_append_rel_size() processing then continues for the root partitioned
>> table.  It is at this point that we will create the RelOptInfos and
>> AppendRelInfos for partitions.  First for those of the root partitioned
>> table and then for those of each partitioned table when
>> set_append_rel_size() will be recursively called for the latter.
> 
> set_append_rel_size(), set_append_rel_pathlist() are already
> recursive, so if we process expansion and pruning for one level in
> those functions, the recursion will automatically take care of doing
> so for every level.

My only worry about that is locking order of leaf partitions will be
different from concurrent backends if we lock them in an order dictated by
traversing the partition tree level at a time.  Because such traversal
will presumably proceed in the partition bound order.

The patch computes *all* leaf partitions that will need to be scanned by
the query (after pruning needless ones) and lock the chosen partitions in
the original order (it keeps the original order OID list generated by
find_all_inheritors around for this purpose).  While computing the leaf
partitions, it remembers immediate parent-child relationships in the
process.  The way it does it is by processing the partition tree in a
recursive depth-first manner, and in each recursive step, creating a
PartitionAppendInfo that maps a parent table to its immediate children
(only those that will satisfy the query).  Once the PartitionAppendInfo's
for all the parents in the partition tree have been computed, we resume
the set_append_rel_size() processing, which takes the root parent's
PartitionAppendInfo and builds RelOptInfos and AppendRelInfos for its
immediate children.  Its children that are partitioned tables themselves
will recursively call set_append_rel_size() and look up its own
PartitionAppendInfo and create RelOptInfos and AppendRelInfos for its
children and so on.

>> Note that this is still largely a WIP patch and the implementation details
>> might change per both the feedback here and the discussion at [1].
> 
> The changes to code which handle expansion in this patch set should
> really be part of expansion in bound order thread so that it's easy to
> review all changes together. And this thread can then only concentrate
> on partition pruning.

I think I agree.  I'm posting today the patches that actually implement
partition-pruning.  The previous patches do seem to belong on the EIBO
thread, but will post them together here today for convenience of being
able to apply them to HEAD and try out.  Now that Robert has posted a
patch to implement depth-first EIBO, I will have to find a way to rebase
the actual partition-pruning patches on this thread so that its core logic
works at all by finding the information it needs.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

Reply via email to