Sorry forgot to mention: this patch applies on top of the v7 patches posted by Amit Langote on 27th June ( https://www.postgresql.org/message-id/81371428-bb4b-1e33-5ad6-8c5c51b52cb7%40lab.ntt.co.jp ).
On Tue, Jul 19, 2016 at 7:41 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > > > On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmh...@gmail.com> > wrote: > >> >> I haven't reviewed this code yet due to being busy with 9.6, but I >> think this is a very important query planner improvement with the >> potential for big wins on queries involving large amounts of data. >> >> Suppose we have a pair of equi-partitioned tables. Right now, if we >> choose to perform a hash join, we'll have to build a giant hash table >> with all of the rows from every inner partition and then probe it for >> every row in every outer partition. If there are few enough inner >> rows that the resultant hash table still fits in work_mem, this is >> somewhat inefficient but not terrible - but if it causes us to have to >> batch the hash join where we otherwise would not need to do so, then >> it really sucks. Similarly, if we decide to merge-join each pair of >> partitions, a partitionwise join may be able to use an internal sort >> on some or all partitions whereas if we had to deal with all of the >> data at the same time we'd need an external sort, possibly multi-pass. >> > > Or we might be able to use indexes directly without need of a MergeAppend. > > >> And if we choose a nested loop, say over an inner index-scan, we do >> O(outer rows) index probes with this optimization but O(outer rows * >> inner partitions) index probes without it. >> >> In addition, parallel query can benefit significantly from this kind >> of optimization. Tom recently raised the case of an appendrel where >> every child has a parallel-safe path but not every child has a partial >> path; currently, we can't go parallel in that case, but it's easy to >> see that we could handle it by scheduling the appendrel's children >> across a pool of workers. If we had this optimization, that sort of >> thing would be much more likely to be useful, because it could create >> appendrels where each member is an N-way join between equipartitioned >> tables. That's particularly important right now because of the >> restriction that a partial path must be driven by a Parallel SeqScan, >> but even after that restriction is lifted it's easy to imagine that >> the effective degree of parallelism for a single index scan may be >> limited - so this kind of thing may significantly increase the number >> of workers that a given query can use productively. >> > > +1. > > The attached patch implements the logic to assess whether two partitioned > tables can be joined using partition-wise join technique described in my > last > mail on this thread. > > Two partitioned relations are considered for partition-wise join if > following > conditions are met (See build_joinrel_part_info() for details): > 1. Both the partitions have same number of partitions, with same number of > partition keys and partitioned by same strategy - range or list. > 2. They have matching datatypes for partition keys (partkey_types_match()) > 3. For list partitioned relations, they have same lists for each pair of > partitions, paired by position in which they appear. > 4. For range partitioned relations, they have same bounds for each pair of > partitions, paired by their position when ordered in ascending fashion on > the > upper bounds. > 5. There exists an equi-join condition for each pair of partition keys, > paired > by the position in which they appear. > > Partition-wise join technique can be applied under more lenient > constraints [1] > e.g. joins between tables with different number of partitions but having > same > bounds/lists for the common partitions. I am planning to defer that to a > later > version of this feature. > > A join executed using partition-wise join technique is itself a relation > partitioned by the similar partitioning scheme as the joining relations > with > the partition keys combined from the joining relations. > > A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) > structure > is used to store the partitioning information for a given base or relation. > In build_simple_rel(), we construct PartitionOptInfo structure for the > given > base relation by copying the relation's PartitionDesc and PartitionKey > (structures from Amit Langote's patch). While doing so, all the partition > keys > are stored as expressions. The structure also holds the RelOptInfos of the > partition relations. For a join relation, most of the PartitionOptInfo is > copied from either of the joining relations, except the partition keys and > RelOptInfo of partition relations. Partition keys of the join relations are > created by combing partition keys from both the joining relations. The > logic to > cosnstruct RelOptInfo for the partition-wise join relations is yet to be > implemented. > > Since the logic to create the paths and RelOptInfos for partition-wise join > relations is not implemented yet, a query which can use partition-wise join > fails with error > "ERROR: the relation was considered for partition-wise join, which is not > supported right now.". It will also print messages to show which of the > joins > can and can not use partition-wise join technique e.g. > "NOTICE: join between relations (b 1) and (b 2) is considered for > partition-wise join." The relations are indicated by their relid in the > query. > OR > "NOTICE: join between relations (b 1) and (b 2) is NOT considered for > partition-wise join.". > These messages are for debugging only, and will be removed once path > creation > logic is implemented. > > The patch adds a test partition_join.sql, which has a number of positive > and > negative testcases for joins between partitioned tables. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company