Sorry forgot to mention: this patch applies on top of the v7 patches posted
by Amit Langote on 27th June (

On Tue, Jul 19, 2016 at 7:41 PM, Ashutosh Bapat <> wrote:

> On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <>
> wrote:
>> I haven't reviewed this code yet due to being busy with 9.6, but I
>> think this is a very important query planner improvement with the
>> potential for big wins on queries involving large amounts of data.
>> Suppose we have a pair of equi-partitioned tables.  Right now, if we
>> choose to perform a hash join, we'll have to build a giant hash table
>> with all of the rows from every inner partition and then probe it for
>> every row in every outer partition.  If there are few enough inner
>> rows that the resultant hash table still fits in work_mem, this is
>> somewhat inefficient but not terrible - but if it causes us to have to
>> batch the hash join where we otherwise would not need to do so, then
>> it really sucks.  Similarly, if we decide to merge-join each pair of
>> partitions, a partitionwise join may be able to use an internal sort
>> on some or all partitions whereas if we had to deal with all of the
>> data at the same time we'd need an external sort, possibly multi-pass.
> Or we might be able to use indexes directly without need of a MergeAppend.
>>   And if we choose a nested loop, say over an inner index-scan, we do
>> O(outer rows) index probes with this optimization but O(outer rows *
>> inner partitions) index probes without it.
>> In addition, parallel query can benefit significantly from this kind
>> of optimization.  Tom recently raised the case of an appendrel where
>> every child has a parallel-safe path but not every child has a partial
>> path; currently, we can't go parallel in that case, but it's easy to
>> see that we could handle it by scheduling the appendrel's children
>> across a pool of workers.  If we had this optimization, that sort of
>> thing would be much more likely to be useful, because it could create
>> appendrels where each member is an N-way join between equipartitioned
>> tables.  That's particularly important right now because of the
>> restriction that a partial path must be driven by a Parallel SeqScan,
>> but even after that restriction is lifted it's easy to imagine that
>> the effective degree of parallelism for a single index scan may be
>> limited - so this kind of thing may significantly increase the number
>> of workers that a given query can use productively.
> +1.
> The attached patch implements the logic to assess whether two partitioned
> tables can be joined using partition-wise join technique described in my
> last
> mail on this thread.
> Two partitioned relations are considered for partition-wise join if
> following
> conditions are met (See build_joinrel_part_info() for details):
> 1. Both the partitions have same number of partitions, with same number of
> partition keys and partitioned by same strategy - range or list.
> 2. They have matching datatypes for partition keys (partkey_types_match())
> 3. For list partitioned relations, they have same lists for each pair of
> partitions, paired by position in which they appear.
> 4. For range partitioned relations, they have same bounds for each pair of
> partitions, paired by their position when ordered in ascending fashion on
> the
> upper bounds.
> 5. There exists an equi-join condition for each pair of partition keys,
> paired
> by the position in which they appear.
> Partition-wise join technique can be applied under more lenient
> constraints [1]
> e.g. joins between tables with different number of partitions but having
> same
> bounds/lists for the common partitions. I am planning to defer that to a
> later
> version of this feature.
> A join executed using partition-wise join technique is itself a relation
> partitioned by the similar partitioning scheme as the joining relations
> with
> the partition keys combined from the joining relations.
> A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo)
> structure
> is used to store the partitioning information for a given base or relation.
> In build_simple_rel(), we construct PartitionOptInfo structure for the
> given
> base relation by copying the relation's PartitionDesc and PartitionKey
> (structures from Amit Langote's patch). While doing so, all the partition
> keys
> are stored as expressions. The structure also holds the RelOptInfos of the
> partition relations. For a join relation, most of the PartitionOptInfo is
> copied from either of the joining relations, except the partition keys and
> RelOptInfo of partition relations. Partition keys of the join relations are
> created by combing partition keys from both the joining relations. The
> logic to
> cosnstruct RelOptInfo for the partition-wise join relations is yet to be
> implemented.
> Since the logic to create the paths and RelOptInfos for partition-wise join
> relations is not implemented yet, a query which can use partition-wise join
> fails with error
> "ERROR: the relation was considered for partition-wise join, which is not
> supported right now.". It will also print messages to show which of the
> joins
> can and can not use partition-wise join technique e.g.
> "NOTICE:  join between relations (b 1) and (b 2) is considered for
> partition-wise join." The relations are indicated by their relid in the
> query.
> OR
> "NOTICE:  join between relations (b 1) and (b 2) is NOT considered for
> partition-wise join.".
> These messages are for debugging only, and will be removed once path
> creation
> logic is implemented.
> The patch adds a test partition_join.sql, which has a number of positive
> and
> negative testcases for joins between partitioned tables.
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply via email to