On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmh...@gmail.com> wrote: > I decided to skip over 0001 for today and spend some time looking at > 0002-0006.
Back to 0001. + Enables or disables the query planner's use of partition-wise join + plans. When enabled, it spends time in creating paths for joins between + partitions and consumes memory to construct expression nodes to be used + for those joins, even if partition-wise join does not result in the + cheapest path. The time and memory increase exponentially with the + number of partitioned tables being joined and they increase linearly + with the number of partitions. The default is <literal>off</>. I think this is too scary and too much technical detail. I think you could just say something like: Enables or disables use of partition-wise join, which allows a join between partitioned tables to be performed by joining the matching partitions. Partition-wise join currently applies only when the join conditions include all the columns of the partition keys, which must be of the same data type and have exactly matching sets of child partitions. Because partition-wise join planning can use significantly increase CPU time and memory usage during planning, the default is <literal>off</>. +partitioned table. The join partners can not be found in other partitions. This +condition allows the join between partitioned tables to be broken into joins +between the matching partitions. The resultant join is partitioned in the same "The join partners can not be found in other partitions." is redundant with the previous sentence. I suggest deleting it. I also suggest "This condition allows the join between partitioned tables to be broken" -> "Because of this, the join between partitioned tables can be broken". +relation" for both partitioned table as well as join between partitioned tables +which can use partition-wise join technique. for either a partitioned table or a join between compatibly partitioned tables +Partitioning properties of a partitioned relation are stored in +PartitionSchemeData structure. Planner maintains a list of canonical partition +schemes (distinct PartitionSchemeData objects) so that any two partitioned +relations with same partitioning scheme share the same PartitionSchemeData +object. This reduces memory consumed by PartitionSchemeData objects and makes +it easy to compare the partition schemes of joining relations. Not all of the partitioning properties are stored in the PartitionSchemeData structure any more. I think this needs some rethinking and maybe some expansion. As written, each of the first two sentences needs a "the" at the beginning. + /* + * Create "append" paths for partitioned joins. Do this before + * creating GatherPaths so that partial "append" paths in + * partitioned joins will be considered. + */ I think you could shorten this to a single-line comment and just keep the first sentence. Similarly in the other location where you have the same sort of thing. + * child-joins. Otherwise, add_path might delete a path that some "append" + * path has reference to. to which some path generated here has a reference. Here and elsewhere, you use "append" rather than Append to refer to the paths added. I suppose that's weasel-wording to work around the fact that they might be either Append or MergeAppend paths, but I'm not sure it's really going to convey that to anyone. I suggest rephrasing those comments more generically, e.g.: + /* Add "append" paths containing paths from child-joins. */ You could say: Build additional paths for this rel from child-join paths. Or something. + if (!REL_HAS_ALL_PART_PROPS(rel)) + return; Isn't this an unnecessarily expensive test? I mean, it shouldn't be possible for it to have some arbitrary subset. + /* + * Every pair of joining relations we see here should have an equi-join + * between partition keys if this join has been deemed as a partitioned + * join. See build_joinrel_partition_info() for reasons. + */ + Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype, + parent_restrictlist)); I suggest removing this assertion. Seems like overkill to me. + child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo, + child_rel1->relids, + child_rel2->relids); It seems like we might end up doing this multiple times for the same child join, if there are more than 2 tables involved. Not sure if there's a good way to avoid that. Similarly for child_restrictlist. + pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks); Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack allocation is a lot faster, and then you don't need to pfree it. + /* Remove the relabel decoration. */ the -> any, decoration -> decorations + /* + * Replace the Var nodes of parent with those of children in expressions. + * This function may be called within a temporary context, but the + * expressions will be shallow-copied into the plan. Hence copy those in + * the planner's context. + */ I can't make heads or tails of this comment. --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -23,7 +23,9 @@ #include "optimizer/pathnode.h" #include "optimizer/paths.h" #include "optimizer/planmain.h" +#include "optimizer/prep.h" #include "optimizer/restrictinfo.h" +#include "optimizer/tlist.h" #include "optimizer/var.h" #include "parser/parsetree.h" #include "utils/lsyscache.h" Maybe not needed? This is the only hunk in this file? Or should this be part of one of the later patches? + Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel)); + + /* Ensure child relation is really what it claims to be. */ + Assert(IS_OTHER_REL(childrel)); I suggest tightening this up a bit by removing the comment and the blank line that precedes it. + foreach(lc, parentrel->reltarget->exprs) + { + PlaceHolderVar *phv = lfirst(lc); + + if (IsA(phv, PlaceHolderVar)) + { + /* + * In case the placeholder Var refers to any of the parent + * relations, translate it to refer to the corresponding child. + */ + if (bms_overlap(phv->phrels, parentrel->relids) && + childrel->reloptkind == RELOPT_OTHER_JOINREL) + { + phv = (PlaceHolderVar *) adjust_appendrel_attrs(root, + (Node *) phv, + nappinfos, + appinfos); + } + + childrel->reltarget->exprs = lappend(childrel->reltarget->exprs, + phv); + phv_added = true; + } + } What if the PHV is buried down inside the expression someplace rather than being at the top level? More generally, why are we not just applying adjust_appendrel_attrs() to the whole expression? + /* Adjust the cost and width of child targetlist. */ + if (phv_added) + { + childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup; + childrel->reltarget->cost.per_tuple = parentrel->reltarget->cost.per_tuple; + childrel->reltarget->width = parentrel->reltarget->width; + } Making this conditional on phv_added is probably not saving anything. Branches are expensive. /* * Otherwise, anything in a baserel or joinrel targetlist ought to be - * a Var. (More general cases can only appear in appendrel child - * rels, which will never be seen here.) + * a Var or ConvertRowtypeExpr. For either of those, find the original + * baserel where they originate. */ Hmm, but now we could potentially see an appendrel child rel here, so don't we need to worry about more general cases? If not, let's explain why not. + * if, it's a ConvertRowtypeExpr, it will be computed only for the American usage does not put a comma after if like this (unless you are writing writing if, for example, blah blah blah -- but there the commas are to surround for example, not due to the if itself). +/* + * build_joinrel_partition_info + * If the join between given partitioned relations is possibly partitioned + * set the partitioning scheme and partition keys expressions for the + * join. + * + * If the two relations have same partitioning scheme, their join may be + * partitioned and will follow the same partitioning scheme as the joining + * relations. + */ I think you could drop the primary comment block and use the secondary block as the primary one. That is, get rid of "If the join between..." and promote "If the two relations...". + * The join is not partitioned, if any of the relations being joined are Another comma that's not typical of American usage. + * For an N-way inner join, where every syntactic inner join has equi-join has -> has an + * For an N-way join with outer joins, where every syntactic join has an + * equi-join between partition keys and a matching partitioning scheme, + * outer join reordering identities in optimizer/README imply that only + * those pairs of join are legal which have an equi-join between partition + * keys. Thus every pair of joining relations we see for this join should + * have an equi-join between partition keys if this join has been deemed + * as a partitioned join. In line 2, partition keys -> the partition keys In line 3, outer join -> the outer join "pairs of join" sounds wrong too, although I'm not sure how to reword it. More broadly: I don't think I understand this comment. The statement about "those pairs of join are legal which have an equi-join between partition keys" doesn't match my understanding e.g. A IJ B ON A.x = B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such clause syntatically. Maybe you could replace this whole comment block with something like this: We can only consider this join as an input to further partition-wise joins if (a) the input relations are partitioned, (b) the partition schemes match, and (c) we can identify an equi-join between the partition keys. Note that if it were possible for have_partkey_equi_join to return different answers for the same joinrel depending on which join ordering we try first, this logic would break. That shouldn't happen, though, because of the way the query planner deduces implied equalities. + * Join relation is partitioned using same partitioning scheme as the + * joining relations and has same bounds. the same partitioning scheme + * An INNER join between two partitioned relations is partitioned by key + * expressions from both the relations. For tables A and B partitioned by + * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by + * both A.a and B.b. + * + * A SEMI/ANTI join only retains data from the outer side and is + * partitioned by the partition keys of the outer side. I would write: An INNER join between two partitioned relations can be regarded as partitioned by either key expression. For example, A INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on B.b; they are equivalent. For a SEMI or ANTI join, the result can only be regarded as being partitioned in the same manner as the outer side, since the inner columns are not retained. + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with + * B.b NULL. These rows may not fit the partitioning conditions imposed on + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. Good. + * Strictly speaking, partition keys of an OUTER join should include + * partition key expressions from the OUTER side only. Consider a join I would join this with the previous sentence instead of repeating strictly speaking: ...and thus the partition keys should include partition key expressions from the OUTER side only. After that sentence, I'd skip a lot of the intermediate words here and continue this way: However, because all commonly-used comparison operators are strict, the presence of nulls on the outer side doesn't cause any problem; they can't match anything at future join levels anyway. Therefore, we track two sets of expressions: those that authentically partition the relation (partexprs) and those that partition the relation with the exception that extra nulls may be present (nullable_partexprs). When the comparison operator is strict, the latter is just as good as the former. Then, I think you can omit the rest of what you have; it should be clear enough what's going on for the full and right cases given that explanation. + * being joined. partexprs and nullable_partexprs are arrays containing part_scheme->partnatts Long line, needs reflowing. I don't think this is too far from being committable. You've done some nice work here! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers