Re: d25ea01275 and partitionwise join
On Wed, Apr 8, 2020 at 11:17 AM Tom Lane wrote: > Amit Langote writes: > > On Tue, Apr 7, 2020 at 2:41 AM Tom Lane wrote: > >> I *think* that it might be all right, because although what we > >> are matching to is a user-written COALESCE() not an actual > >> FULL JOIN USING column, it has to behave in somewhat the same > >> way. In particular, by construction it must be a coalesce of > >> some representation of the matching partition columns of the > >> full join's inputs. So, even though it might go to null in > >> different cases than an actual USING variable would do, it > >> does not break the ability to partition the join. > > > Seems fine to me too. Maybe users should avoid writing it by hand if > > possible anyway, because even slight variation in the way it's written > > will affect this: > > I'm not particularly concerned about users intentionally trying to trigger > this behavior. I just want to be sure that if someone accidentally does > so, we don't produce a wrong plan. > > I waited till after the "advanced partitionwise join" patch went > in because that seemed more important (plus I wondered a bit if > that would subsume this). But this patch seems to still work, > and the other thing doesn't fix the problem, so pushed. Thank you for your time on this. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: d25ea01275 and partitionwise join
On Wed, Apr 8, 2020 at 11:17 AM Tom Lane wrote: > But this patch seems to still work, > and the other thing doesn't fix the problem, so pushed. Thanks for working on this! Best regards, Etsuro Fujita
Re: d25ea01275 and partitionwise join
Amit Langote writes: > On Tue, Apr 7, 2020 at 2:41 AM Tom Lane wrote: >> I *think* that it might be all right, because although what we >> are matching to is a user-written COALESCE() not an actual >> FULL JOIN USING column, it has to behave in somewhat the same >> way. In particular, by construction it must be a coalesce of >> some representation of the matching partition columns of the >> full join's inputs. So, even though it might go to null in >> different cases than an actual USING variable would do, it >> does not break the ability to partition the join. > Seems fine to me too. Maybe users should avoid writing it by hand if > possible anyway, because even slight variation in the way it's written > will affect this: I'm not particularly concerned about users intentionally trying to trigger this behavior. I just want to be sure that if someone accidentally does so, we don't produce a wrong plan. I waited till after the "advanced partitionwise join" patch went in because that seemed more important (plus I wondered a bit if that would subsume this). But this patch seems to still work, and the other thing doesn't fix the problem, so pushed. regards, tom lane
Re: d25ea01275 and partitionwise join
On Tue, Apr 7, 2020 at 2:41 AM Tom Lane wrote: > Amit Langote writes: > > which does succeed in using partitionwise join. Please see attached > > delta that applies on your v7 if that is what you'd rather have. > > I figured these queries were cheap enough that we could afford to run > both. With that and some revision of the comments (per attached), > I was feeling like we were ready to go. Looks good to me. > However, re-reading the thread, > one of Richard's comments struck me as still relevant. If you try, say, > > create table p (k int, val int) partition by range(k); > create table p_1 partition of p for values from (1) to (10); > create table p_2 partition of p for values from (10) to (100); > > set enable_partitionwise_join = 1; > > explain > select * from (p as t1 full join p as t2 on t1.k = t2.k) as > t12(k1,val1,k2,val2) > full join p as t3 on COALESCE(t12.k1, t12.k2) = t3.k; > > this patch will give you a partitioned join, with a different plan > than you get without enable_partitionwise_join. This is scary, > because it's not immediately obvious that the transformation is > correct. > > I *think* that it might be all right, because although what we > are matching to is a user-written COALESCE() not an actual > FULL JOIN USING column, it has to behave in somewhat the same > way. In particular, by construction it must be a coalesce of > some representation of the matching partition columns of the > full join's inputs. So, even though it might go to null in > different cases than an actual USING variable would do, it > does not break the ability to partition the join. Seems fine to me too. Maybe users should avoid writing it by hand if possible anyway, because even slight variation in the way it's written will affect this: set enable_partitionwise_join = 1; -- order of coalesce() arguments reversed explain (costs off) select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2) full join p as t3 on COALESCE(t12.k2, t12.k1) = t3.k; QUERY PLAN -- Hash Full Join Hash Cond: (COALESCE(t2.k, t1.k) = t3.k) -> Append -> Hash Full Join Hash Cond: (t1_1.k = t2_1.k) -> Seq Scan on p_1 t1_1 -> Hash -> Seq Scan on p_1 t2_1 -> Hash Full Join Hash Cond: (t1_2.k = t2_2.k) -> Seq Scan on p_2 t1_2 -> Hash -> Seq Scan on p_2 t2_2 -> Hash -> Append -> Seq Scan on p_1 t3_1 -> Seq Scan on p_2 t3_2 (17 rows) > However, I have not spent a whole lot of time thinking about > partitionwise joins, so rather than go ahead and commit I am > going to toss that point back out for community consideration. Agreed. > At the very least, what I'd written in the comment needs a > lot more defense than it has now. Sorry, which comment are you referring to? -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: d25ea01275 and partitionwise join
Amit Langote writes: > which does succeed in using partitionwise join. Please see attached > delta that applies on your v7 if that is what you'd rather have. I figured these queries were cheap enough that we could afford to run both. With that and some revision of the comments (per attached), I was feeling like we were ready to go. However, re-reading the thread, one of Richard's comments struck me as still relevant. If you try, say, create table p (k int, val int) partition by range(k); create table p_1 partition of p for values from (1) to (10); create table p_2 partition of p for values from (10) to (100); set enable_partitionwise_join = 1; explain select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2) full join p as t3 on COALESCE(t12.k1, t12.k2) = t3.k; this patch will give you a partitioned join, with a different plan than you get without enable_partitionwise_join. This is scary, because it's not immediately obvious that the transformation is correct. I *think* that it might be all right, because although what we are matching to is a user-written COALESCE() not an actual FULL JOIN USING column, it has to behave in somewhat the same way. In particular, by construction it must be a coalesce of some representation of the matching partition columns of the full join's inputs. So, even though it might go to null in different cases than an actual USING variable would do, it does not break the ability to partition the join. However, I have not spent a whole lot of time thinking about partitionwise joins, so rather than go ahead and commit I am going to toss that point back out for community consideration. At the very least, what I'd written in the comment needs a lot more defense than it has now. Thoughts? regards, tom lane diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index af1fb48..d190b4b 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -17,6 +17,7 @@ #include #include "miscadmin.h" +#include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" @@ -1890,7 +1891,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, RelOptInfo *outer_rel, RelOptInfo *inner_rel, JoinType jointype) { - int partnatts = joinrel->part_scheme->partnatts; + PartitionScheme part_scheme = joinrel->part_scheme; + int partnatts = part_scheme->partnatts; joinrel->partexprs = (List **) palloc0(sizeof(List *) * partnatts); joinrel->nullable_partexprs = @@ -1899,7 +1901,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, /* * The joinrel's partition expressions are the same as those of the input * rels, but we must properly classify them as nullable or not in the - * joinrel's output. + * joinrel's output. (Also, we add some more partition expressions if + * it's a FULL JOIN.) */ for (int cnt = 0; cnt < partnatts; cnt++) { @@ -1910,6 +1913,7 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, const List *inner_null_expr = inner_rel->nullable_partexprs[cnt]; List *partexpr = NIL; List *nullable_partexpr = NIL; + ListCell *lc; switch (jointype) { @@ -1969,6 +1973,38 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, outer_null_expr); nullable_partexpr = list_concat(nullable_partexpr, inner_null_expr); + +/* + * Also add CoalesceExprs corresponding to each possible + * full-join output variable (that is, left side coalesced to + * right side), so that we can match equijoin expressions + * using those variables. We really only need these for + * columns merged by JOIN USING, and only with the pairs of + * input items that correspond to the data structures that + * parse analysis would build for such variables. But it's + * hard to tell which those are, so just make all the pairs. + * Extra items in the nullable_partexprs list won't cause big + * problems. We assume no type coercions are needed to make + * the coalesce expressions, since columns of different types + * won't have gotten classified as the same PartitionScheme. + */ +foreach(lc, list_concat_copy(outer_expr, outer_null_expr)) +{ + Node *larg = (Node *) lfirst(lc); + ListCell *lc2; + + foreach(lc2, list_concat_copy(inner_expr, inner_null_expr)) + { + Node *rarg = (Node *) lfirst(lc2); + CoalesceExpr *c = makeNode(CoalesceExpr); + + c->coalescetype = exprType(larg); + c->coalescecollid = exprCollation(larg); + c->args = list_make2(larg, rarg); + c->location = -1; + nullable_partexpr = lappend(nullable_partexpr, c); + } +} break; default: diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h index 469c686..39c7b2f 100644 --- a/src/include/nodes/pathnodes.h +++ b/src/include/n
Re: d25ea01275 and partitionwise join
On Mon, Apr 6, 2020 at 11:09 PM Tom Lane wrote: > Amit Langote writes: > > Oops, I thought I copy-pasted 4-way full join test not this one, but > > evidently didn't. > > Have you got such a query at hand? I wondered whether we shouldn't > use a 4-way rather than 3-way test case; it'd offer more assurance > that nesting of these things works. Hmm, I just did: -SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) +SELECT COUNT(*) FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b) FULL JOIN prt2 p3(b,a,c) USING (a, b) FULL JOIN prt1 p4 (a,b,c) USING (a, b) which does succeed in using partitionwise join. Please see attached delta that applies on your v7 if that is what you'd rather have. -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com v7-test-delta.patch Description: Binary data
Re: d25ea01275 and partitionwise join
Amit Langote writes: > On Mon, Apr 6, 2020 at 7:29 AM Tom Lane wrote: >> * I think your logic for building the coalesce combinations was just >> wrong. We need combinations of left-hand inputs with right-hand inputs, >> not left-hand with left-hand or right-hand with right-hand. Also the >> nesting should already have happened in the inputs, we don't need to >> try to generate it here. The looping code was pretty messy as well. > It didn't occur to me that that many Coalesce combinations would be > necessary given the component rel combinations possible. Well, we don't of course: we only need the one pair that corresponds to the COALESCE structures the parser would have generated. But we aren't sure which one that is. I thought about looking through the full join RTE's joinaliasvars list for COALESCE items instead of doing it like this, but there is a problem: I don't believe that that data structure gets maintained after flatten_join_alias_vars(). So it might contain out-of-date expressions that don't match what we need them to match. Possibly there will be a cleaner answer here if I succeed in redesigning the Var data structure to account for outer joins better. >> * I did not follow the point of your second test case. The WHERE >> constraint on p1.a allows the planner to strength-reduce the joins, >> which is why there's no full join in that explain result, but then >> we aren't going to get to this code at all. > Oops, I thought I copy-pasted 4-way full join test not this one, but > evidently didn't. Have you got such a query at hand? I wondered whether we shouldn't use a 4-way rather than 3-way test case; it'd offer more assurance that nesting of these things works. regards, tom lane
Re: d25ea01275 and partitionwise join
On Mon, Apr 6, 2020 at 7:29 AM Tom Lane wrote: > Amit Langote writes: > > Okay, I tried that in the updated patch. I didn't try to come up with > > examples that might break it though. > > I looked through this. Thank you. > * I think your logic for building the coalesce combinations was just > wrong. We need combinations of left-hand inputs with right-hand inputs, > not left-hand with left-hand or right-hand with right-hand. Also the > nesting should already have happened in the inputs, we don't need to > try to generate it here. The looping code was pretty messy as well. It didn't occur to me that that many Coalesce combinations would be necessary given the component rel combinations possible. > * I don't think using partopcintype is necessarily right; that could be > a polymorphic type, for instance. Safer to copy the type of the input > expressions. Likely we could have got away with using partcollation, > but for consistency I copied that as well. Ah, seeing set_baserel_partition_key_exprs(), I suppose they will come from parttypid and parttypcoll of the base partitioned tables, which seems fine. > * You really need to update the data structure definitional comments > when you make a change like this. Sorry, I should have. > * I did not like putting a test case that requires > enable_partitionwise_aggregate in the partition_join test; that seems > misplaced. But it's not necessary to the test, is it? Earlier in the discussion (which turned into a separate discussion), there were test cases where partition-level grouping would fail with errors in setrefs.c, but I think that was fixed last year by 529ebb20aaa5. Agree that it has nothing to do with the problem being solved here. > * I did not follow the point of your second test case. The WHERE > constraint on p1.a allows the planner to strength-reduce the joins, > which is why there's no full join in that explain result, but then > we aren't going to get to this code at all. Oops, I thought I copy-pasted 4-way full join test not this one, but evidently didn't. > I repaired the above in the attached. > > I'm actually sort of pleasantly surprised that this worked; I was > not sure that building COALESCEs like this would provide the result > we needed. But it seems okay -- it does fix the behavior in this > 3-way test case, as well as the 4-way join you showed at the top > of the thread. It's fairly dependent on the fact that the planner > won't rearrange full joins; otherwise, the COALESCE structures we > build here might not match those made at parse time. But that's > not likely to change anytime soon; and this is hardly the only > place that would break, so I'm not sweating about it. (I have > some vague ideas about getting rid of the COALESCEs as part of > the Var redefinition I've been muttering about, anyway; there > might be a cleaner fix for this if that happens.) > > Anyway, I think this is probably OK for now. Given that the > nullable_partexprs lists are only used in one place, it's pretty > hard to see how it would break anything. Makes sense. -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: d25ea01275 and partitionwise join
Amit Langote writes: > Okay, I tried that in the updated patch. I didn't try to come up with > examples that might break it though. I looked through this. * Wasn't excited about inventing makeCoalesceExpr(); the fact that it only had two potential call sites seemed to make it not worth the trouble. Plus, as defined, it could not handle the general case of COALESCE, which can have N arguments; so that seemed like a recipe for confusion. * I think your logic for building the coalesce combinations was just wrong. We need combinations of left-hand inputs with right-hand inputs, not left-hand with left-hand or right-hand with right-hand. Also the nesting should already have happened in the inputs, we don't need to try to generate it here. The looping code was pretty messy as well. * I don't think using partopcintype is necessarily right; that could be a polymorphic type, for instance. Safer to copy the type of the input expressions. Likely we could have got away with using partcollation, but for consistency I copied that as well. * You really need to update the data structure definitional comments when you make a change like this. * I did not like putting a test case that requires enable_partitionwise_aggregate in the partition_join test; that seems misplaced. But it's not necessary to the test, is it? * I did not follow the point of your second test case. The WHERE constraint on p1.a allows the planner to strength-reduce the joins, which is why there's no full join in that explain result, but then we aren't going to get to this code at all. I repaired the above in the attached. I'm actually sort of pleasantly surprised that this worked; I was not sure that building COALESCEs like this would provide the result we needed. But it seems okay -- it does fix the behavior in this 3-way test case, as well as the 4-way join you showed at the top of the thread. It's fairly dependent on the fact that the planner won't rearrange full joins; otherwise, the COALESCE structures we build here might not match those made at parse time. But that's not likely to change anytime soon; and this is hardly the only place that would break, so I'm not sweating about it. (I have some vague ideas about getting rid of the COALESCEs as part of the Var redefinition I've been muttering about, anyway; there might be a cleaner fix for this if that happens.) Anyway, I think this is probably OK for now. Given that the nullable_partexprs lists are only used in one place, it's pretty hard to see how it would break anything. regards, tom lane diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index af1fb48..e1cc11c 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -17,6 +17,7 @@ #include #include "miscadmin.h" +#include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" @@ -1890,7 +1891,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, RelOptInfo *outer_rel, RelOptInfo *inner_rel, JoinType jointype) { - int partnatts = joinrel->part_scheme->partnatts; + PartitionScheme part_scheme = joinrel->part_scheme; + int partnatts = part_scheme->partnatts; joinrel->partexprs = (List **) palloc0(sizeof(List *) * partnatts); joinrel->nullable_partexprs = @@ -1899,7 +1901,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, /* * The joinrel's partition expressions are the same as those of the input * rels, but we must properly classify them as nullable or not in the - * joinrel's output. + * joinrel's output. (Also, we add some more partition expressions if + * it's a FULL JOIN.) */ for (int cnt = 0; cnt < partnatts; cnt++) { @@ -1910,6 +1913,7 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, const List *inner_null_expr = inner_rel->nullable_partexprs[cnt]; List *partexpr = NIL; List *nullable_partexpr = NIL; + ListCell *lc; switch (jointype) { @@ -1969,6 +1973,31 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, outer_null_expr); nullable_partexpr = list_concat(nullable_partexpr, inner_null_expr); + +/* + * Also add CoalesceExprs corresponding to each possible + * full-join output variable (that is, left side coalesced to + * right side), so that we can match equijoin expressions + * using those variables. We assume no type coercions are + * needed to make the join outputs. + */ +foreach(lc, list_concat_copy(outer_expr, outer_null_expr)) +{ + Node *larg = (Node *) lfirst(lc); + ListCell *lc2; + + foreach(lc2, list_concat_copy(inner_expr, inner_null_expr)) + { + Node *rarg = (Node *) lfirst(lc2); + CoalesceExpr *c = makeNode(CoalesceExpr); + + c->coalescetype = exprType(larg); + c->coalescecollid = exprCollation(larg); + c->args = list_make2(larg, rar
Re: d25ea01275 and partitionwise join
On Sat, Apr 4, 2020 at 6:13 AM Tom Lane wrote: > Amit Langote writes: > > Updated patches attached. > > I looked through these and committed 0001+0002, with some further > comment-polishing. However, I have no faith at all in 0003. Thanks for the review. > It is > blithely digging through COALESCE expressions with no concern for > whether they came from full joins or not, or whether the other values > being coalesced to might completely change the semantics. Digging > through PlaceHolderVars scares me even more; what's that got to do > with the problem, anyway? So while this might fix the complained-of > issue of failing to use a partitionwise join, I think it wouldn't be > hard to create examples that it would incorrectly turn into > partitionwise joins. > > I wonder whether it'd be feasible to fix the problem by going in the > other direction; that is, while constructing the nullable_partexprs > lists for a full join, add synthesized COALESCE() expressions for the > output columns (by wrapping COALESCE around copies of the input rels' > partition expressions), and then not need to do anything special in > match_expr_to_partition_keys. We'd still need to convince ourselves > that this did the right thing and not any of the wrong things, but > I think it might be easier to prove it that way. Okay, I tried that in the updated patch. I didn't try to come up with examples that might break it though. -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com v6-0001-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patch Description: Binary data
Re: d25ea01275 and partitionwise join
Amit Langote writes: > Updated patches attached. I looked through these and committed 0001+0002, with some further comment-polishing. However, I have no faith at all in 0003. It is blithely digging through COALESCE expressions with no concern for whether they came from full joins or not, or whether the other values being coalesced to might completely change the semantics. Digging through PlaceHolderVars scares me even more; what's that got to do with the problem, anyway? So while this might fix the complained-of issue of failing to use a partitionwise join, I think it wouldn't be hard to create examples that it would incorrectly turn into partitionwise joins. I wonder whether it'd be feasible to fix the problem by going in the other direction; that is, while constructing the nullable_partexprs lists for a full join, add synthesized COALESCE() expressions for the output columns (by wrapping COALESCE around copies of the input rels' partition expressions), and then not need to do anything special in match_expr_to_partition_keys. We'd still need to convince ourselves that this did the right thing and not any of the wrong things, but I think it might be easier to prove it that way. regards, tom lane
Re: d25ea01275 and partitionwise join
On Sat, Feb 29, 2020 at 8:18 AM Tom Lane wrote: > Amit Langote writes: > > On Wed, Nov 6, 2019 at 2:00 AM Tom Lane wrote: > >> Just to leave a breadcrumb in this thread --- the planner failure > >> induced by d25ea01275 has been fixed in 529ebb20a. The difficulty > >> with multiway full joins that Amit started this thread with remains > >> open, but I imagine the posted patches will need rebasing over > >> 529ebb20a. > > > Here are the rebased patches. > > The cfbot shows these patches as failing regression tests. I think > it is just cosmetic fallout from 6ef77cf46 having changed EXPLAIN's > choices of table alias names; but please look closer to confirm, > and post updated patches. Thanks for notifying. Checked and indeed fallout from 6ef77cf46 seems to be the reason a test is failing. Updated patches attached. Thanks, Amit v5-0002-Move-some-code-from-joinrel.c-to-relnode.c.patch Description: Binary data v5-0001-Some-cosmetic-improvements-to-partitionwise-join-.patch Description: Binary data v5-0003-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patch Description: Binary data
Re: d25ea01275 and partitionwise join
Amit Langote writes: > On Wed, Nov 6, 2019 at 2:00 AM Tom Lane wrote: >> Just to leave a breadcrumb in this thread --- the planner failure >> induced by d25ea01275 has been fixed in 529ebb20a. The difficulty >> with multiway full joins that Amit started this thread with remains >> open, but I imagine the posted patches will need rebasing over >> 529ebb20a. > Here are the rebased patches. The cfbot shows these patches as failing regression tests. I think it is just cosmetic fallout from 6ef77cf46 having changed EXPLAIN's choices of table alias names; but please look closer to confirm, and post updated patches. regards, tom lane
Re: d25ea01275 and partitionwise join
On Wed, Nov 6, 2019 at 2:00 AM Tom Lane wrote: > > Amit Langote writes: > > On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby wrote: > >> I can't say much about the patch; there's a little typo: > >> "The nullability of inner relation keys prevents them to" > >> ..should say "prevent them from". > > > Thanks, will fix. > > Just to leave a breadcrumb in this thread --- the planner failure > induced by d25ea01275 has been fixed in 529ebb20a. The difficulty > with multiway full joins that Amit started this thread with remains > open, but I imagine the posted patches will need rebasing over > 529ebb20a. Here are the rebased patches. I've divided the patch containing only cosmetic improvements into two patches, of which the latter only moves around code. Patch 0003 implements the actual fix to the problem with multiway joins. Thanks, Amit 0001-Some-cosmetic-improvements-to-partitionwise-join-cod.patch Description: Binary data 0002-Move-some-code-from-joinrel.c-to-relnode.c.patch Description: Binary data 0003-Fix-partitionwise-join-to-handle-FULL-JOINs-correctl.patch Description: Binary data
Re: d25ea01275 and partitionwise join
Amit Langote writes: > On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby wrote: >> I can't say much about the patch; there's a little typo: >> "The nullability of inner relation keys prevents them to" >> ..should say "prevent them from". > Thanks, will fix. Just to leave a breadcrumb in this thread --- the planner failure induced by d25ea01275 has been fixed in 529ebb20a. The difficulty with multiway full joins that Amit started this thread with remains open, but I imagine the posted patches will need rebasing over 529ebb20a. regards, tom lane
Re: d25ea01275 and partitionwise join
Hi Justin, On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby wrote: > > On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote: > > Please find attached updated patches. > > Tom pointed me to this thread, since we hit it in 12.0 > https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d > > I can't say much about the patch; there's a little typo: > "The nullability of inner relation keys prevents them to" > ..should say "prevent them from". Thanks, will fix. Regards, Amit
Re: d25ea01275 and partitionwise join
On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote: > Please find attached updated patches. Tom pointed me to this thread, since we hit it in 12.0 https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d I can't say much about the patch; there's a little typo: "The nullability of inner relation keys prevents them to" ..should say "prevent them from". In order to compile it against REL12, I tried to cherry-pick this one: 3373c715: Speed up finding EquivalenceClasses for a given set of rels But then it crashes in check-world (possibly due to misapplied hunks). -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
Re: d25ea01275 and partitionwise join
On Sun, Oct 13, 2019 at 03:02:17PM -0500, Justin Pryzby wrote: > On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote: > > Please find attached updated patches. > > Tom pointed me to this thread, since we hit it in 12.0 > https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d > > I can't say much about the patch; there's a little typo: > "The nullability of inner relation keys prevents them to" > ..should say "prevent them from". > > In order to compile it against REL12, I tried to cherry-pick this one: > 3373c715: Speed up finding EquivalenceClasses for a given set of rels > > But then it crashes in check-world (possibly due to misapplied hunks). I did it again paying more attention and got it to pass. The PWJ + FULL JOIN query seems ok now. But I'll leave PWJ disabled until I can look more closely. $ PGOPTIONS='-c max_parallel_workers_per_gather=0 -c enable_mergejoin=off -c enable_hashagg=off -c enable_partitionwise_join=on' psql postgres -f tmp/sql-2019-10-11.1 SET Nested Loop (cost=80106964.13..131163200.28 rows=2226681567 width=6) Join Filter: ((s.site_location = ''::text) OR ((s.site_office)::integer = ((COALESCE(t1.site_id, t2.site_id))::integer))) -> Group (cost=80106964.13..80837945.46 rows=22491733 width=12) Group Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Merge Append (cost=80106964.13..80613028.13 rows=22491733 width=12) Sort Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Group (cost=25494496.54..25633699.28 rows=11136219 width=12) Group Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Sort (cost=25494496.54..25522337.09 rows=11136219 width=12) Sort Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer) -> Hash Full Join (cost=28608.75..24191071.36 rows=11136219 width=12) Hash Cond: ((t1.start_time = t2.start_time) AND (t1.site_id = t2.site_id)) Filter: ((COALESCE(t1.start_time, t2.start_time) >= '2019-10-01 00:00:00'::timestamp without time zone) AND (COALESCE(t1.start_time, t2.start_time) < '2019-10-01 01:00:00'::timestamp without time zone)) -> Seq Scan on t1 (cost=0.00..14495.10 rows=940910 width=10) -> Hash (cost=14495.10..14495.10 rows=940910 width=10) -> Seq Scan on t1 t2 (cost=0.00..14495.10 rows=940910 width=10) -> Group (cost=54612467.58..54754411.51 rows=11355514 width=12) Group Key: (COALESCE(t1_1.start_time, t2_1.start_time)), ((COALESCE(t1_1.site_id, t2_1.site_id))::integer) -> Sort (cost=54612467.58..54640856.37 rows=11355514 width=12) Sort Key: (COALESCE(t1_1.start_time, t2_1.start_time)), ((COALESCE(t1_1.site_id, t2_1.site_id))::integer) -> Hash Full Join (cost=28608.75..53281777.94 rows=11355514 width=12) Hash Cond: ((t1_1.start_time = t2_1.start_time) AND (t1_1.site_id = t2_1.site_id)) Filter: ((COALESCE(t1_1.start_time, t2_1.start_time) >= '2019-10-01 00:00:00'::timestamp without time zone) AND (COALESCE(t1_1.start_time, t2_1.start_time) < '2019-10-01 01:00:00'::timestamp without time zone)) -> Seq Scan on t2 t1_1 (cost=0.00..14495.10 rows=940910 width=10) -> Hash (cost=14495.10..14495.10 rows=940910 width=10) -> Seq Scan on t2 t2_1 (cost=0.00..14495.10 rows=940910 width=10) -> Materialize (cost=0.00..2.48 rows=99 width=6) -> Seq Scan on s (cost=0.00..1.99 rows=99 width=6) -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
Re: d25ea01275 and partitionwise join
On Thu, Sep 19, 2019 at 4:15 PM Amit Langote wrote: > Hi Richard, > > Thanks a lot for taking a close look at the patch and sorry about the > delay. > > On Wed, Sep 4, 2019 at 5:29 PM Richard Guo wrote: > >> On Wed, Sep 4, 2019 at 10:01 AM Amit Langote > wrote: > > I'm reviewing v2-0002 and I have concern about how COALESCE expr is > > processed in match_join_arg_to_partition_keys(). > > > > If there is a COALESCE expr with first arg being non-partition key expr > > and second arg being partition key, the patch would match it to the > > partition key, which may result in wrong results in some cases. > > > > For instance, consider the partition table below: > > > > create table p (k int, val int) partition by range(k); > > create table p_1 partition of p for values from (1) to (10); > > create table p_2 partition of p for values from (10) to (100); > > > > So with patch v2-0002, the following query will be planned with > > partitionwise join. > > > > # explain (costs off) > > select * from (p as t1 full join p as t2 on t1.k = t2.k) as > t12(k1,val1,k2,val2) > > full join p as t3 on COALESCE(t12.val1, > t12.k1) = t3.k; > > QUERY PLAN > > -- > > Append > >-> Hash Full Join > > Hash Cond: (COALESCE(t1.val, t1.k) = t3.k) > > -> Hash Full Join > >Hash Cond: (t1.k = t2.k) > >-> Seq Scan on p_1 t1 > >-> Hash > > -> Seq Scan on p_1 t2 > > -> Hash > >-> Seq Scan on p_1 t3 > >-> Hash Full Join > > Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k) > > -> Hash Full Join > >Hash Cond: (t1_1.k = t2_1.k) > >-> Seq Scan on p_2 t1_1 > >-> Hash > > -> Seq Scan on p_2 t2_1 > > -> Hash > >-> Seq Scan on p_2 t3_1 > > (19 rows) > > > > But as t1.val is not a partition key, actually we cannot use > > partitionwise join here. > > > > If we insert below data into the table, we will get wrong results for > > the query above. > > > > insert into p select 5,15; > > insert into p select 15,5; > > Good catch! It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k > for partitionwise join as the COALESCE expression might as well output > the value of val1 which doesn't conform to partitioning. > > I've fixed match_join_arg_to_partition_keys() to catch that case and > fail. Added a test case as well. > > Please find attached updated patches. > Thank you for the fix. Will review. Thanks Richard
Re: d25ea01275 and partitionwise join
Hi Richard, Thanks a lot for taking a close look at the patch and sorry about the delay. On Wed, Sep 4, 2019 at 5:29 PM Richard Guo wrote: >> On Wed, Sep 4, 2019 at 10:01 AM Amit Langote wrote: > I'm reviewing v2-0002 and I have concern about how COALESCE expr is > processed in match_join_arg_to_partition_keys(). > > If there is a COALESCE expr with first arg being non-partition key expr > and second arg being partition key, the patch would match it to the > partition key, which may result in wrong results in some cases. > > For instance, consider the partition table below: > > create table p (k int, val int) partition by range(k); > create table p_1 partition of p for values from (1) to (10); > create table p_2 partition of p for values from (10) to (100); > > So with patch v2-0002, the following query will be planned with > partitionwise join. > > # explain (costs off) > select * from (p as t1 full join p as t2 on t1.k = t2.k) as > t12(k1,val1,k2,val2) > full join p as t3 on COALESCE(t12.val1, t12.k1) = > t3.k; > QUERY PLAN > -- > Append >-> Hash Full Join > Hash Cond: (COALESCE(t1.val, t1.k) = t3.k) > -> Hash Full Join >Hash Cond: (t1.k = t2.k) >-> Seq Scan on p_1 t1 >-> Hash > -> Seq Scan on p_1 t2 > -> Hash >-> Seq Scan on p_1 t3 >-> Hash Full Join > Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k) > -> Hash Full Join >Hash Cond: (t1_1.k = t2_1.k) >-> Seq Scan on p_2 t1_1 >-> Hash > -> Seq Scan on p_2 t2_1 > -> Hash >-> Seq Scan on p_2 t3_1 > (19 rows) > > But as t1.val is not a partition key, actually we cannot use > partitionwise join here. > > If we insert below data into the table, we will get wrong results for > the query above. > > insert into p select 5,15; > insert into p select 15,5; Good catch! It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k for partitionwise join as the COALESCE expression might as well output the value of val1 which doesn't conform to partitioning. I've fixed match_join_arg_to_partition_keys() to catch that case and fail. Added a test case as well. Please find attached updated patches. Thanks, Amit v3-0003-Add-multi-relation-EC-child-members-in-a-separate.patch Description: Binary data v3-0001-Some-cosmetic-improvements-to-partitionwise-join-.patch Description: Binary data v3-0002-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patch Description: Binary data
Re: d25ea01275 and partitionwise join
Hello Richard, On Wed, Sep 4, 2019 at 4:30 PM Richard Guo wrote: > > Hi Amit, > > On Wed, Sep 4, 2019 at 10:01 AM Amit Langote wrote: >> >> Fujita-san, >> >> To avoid losing track of this, I've added this to November CF. >> >> https://commitfest.postgresql.org/25/2278/ >> >> I know there is one more patch beside the partitionwise join fix, but >> I've set the title to suggest that this is related mainly to >> partitionwise joins. > > > Thank you for working on this. Currently partitionwise join does not > take COALESCE expr into consideration when matching to partition keys. > This is a problem. > > BTW, a rebase is needed for the patch set. Thanks a lot for looking at this. I tried rebasing today and found that adopting this patch to the following recent commit to equivalence processing code would take some time that I don't currently have. commit 3373c7155350cf6fcd51dd090f29e1332901e329 Author: David Rowley Date: Sun Jul 21 17:30:58 2019 +1200 Speed up finding EquivalenceClasses for a given set of rels I will come back to this in a couple of weeks, along with addressing your other comments. Thanks, Amit
Re: d25ea01275 and partitionwise join
Hi Amit, On Wed, Sep 4, 2019 at 3:30 PM Richard Guo wrote: > Hi Amit, > > On Wed, Sep 4, 2019 at 10:01 AM Amit Langote > wrote: > >> Fujita-san, >> >> To avoid losing track of this, I've added this to November CF. >> >> https://commitfest.postgresql.org/25/2278/ >> >> I know there is one more patch beside the partitionwise join fix, but >> I've set the title to suggest that this is related mainly to >> partitionwise joins. >> > > Thank you for working on this. Currently partitionwise join does not > take COALESCE expr into consideration when matching to partition keys. > This is a problem. > > BTW, a rebase is needed for the patch set. > I'm reviewing v2-0002 and I have concern about how COALESCE expr is processed in match_join_arg_to_partition_keys(). If there is a COALESCE expr with first arg being non-partition key expr and second arg being partition key, the patch would match it to the partition key, which may result in wrong results in some cases. For instance, consider the partition table below: create table p (k int, val int) partition by range(k); create table p_1 partition of p for values from (1) to (10); create table p_2 partition of p for values from (10) to (100); So with patch v2-0002, the following query will be planned with partitionwise join. # explain (costs off) select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2) full join p as t3 on COALESCE(t12.val1, t12.k1) = t3.k; QUERY PLAN -- Append -> Hash Full Join Hash Cond: (COALESCE(t1.val, t1.k) = t3.k) -> Hash Full Join Hash Cond: (t1.k = t2.k) -> Seq Scan on p_1 t1 -> Hash -> Seq Scan on p_1 t2 -> Hash -> Seq Scan on p_1 t3 -> Hash Full Join Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k) -> Hash Full Join Hash Cond: (t1_1.k = t2_1.k) -> Seq Scan on p_2 t1_1 -> Hash -> Seq Scan on p_2 t2_1 -> Hash -> Seq Scan on p_2 t3_1 (19 rows) But as t1.val is not a partition key, actually we cannot use partitionwise join here. If we insert below data into the table, we will get wrong results for the query above. insert into p select 5,15; insert into p select 15,5; Thanks Richard
Re: d25ea01275 and partitionwise join
Hi Amit, On Wed, Sep 4, 2019 at 10:01 AM Amit Langote wrote: > Fujita-san, > > To avoid losing track of this, I've added this to November CF. > > https://commitfest.postgresql.org/25/2278/ > > I know there is one more patch beside the partitionwise join fix, but > I've set the title to suggest that this is related mainly to > partitionwise joins. > Thank you for working on this. Currently partitionwise join does not take COALESCE expr into consideration when matching to partition keys. This is a problem. BTW, a rebase is needed for the patch set. Thanks Richard
Re: d25ea01275 and partitionwise join
Fujita-san, On Thu, Jul 18, 2019 at 8:10 PM Etsuro Fujita wrote: > > On Thu, Jul 18, 2019 at 11:18 AM Amit Langote wrote: > > On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita > > wrote: > > > On Tue, Jul 2, 2019 at 6:29 PM Amit Langote > > > wrote: > > > > 0001 - fix partitionwise join to work correctly with n-way joins of > > > > which some are full joins (+ cosmetic improvements around the code > > > > that was touched) > > > > > > Here are my comments about the cosmetic improvements: they seem pretty > > > large to me, so I'd make a separate patch for that. > > > > OK, my bad that I added so many cosmetic changes into a patch that is > > meant to fix the main issue. Just to clarify, I'm proposing these > > cosmetic improvements to better clarify the terminological separation > > between nullable and non-nullable partition keys, which I found a bit > > hard to understand as is. > > OK, thanks for the explanation! > > > I've broken the patch into two: 0001 contains only cosmetic changes > > and 0002 the fix for handling full joins properly. Would you rather > > that be reversed? > > I like this order. > > > > In addition, I'd > > > move have_partkey_equi_join() and match_expr_to_partition_keys() to > > > relnode.c, because these functions are only used in that file. > > > > I hadn't noticed that. Makes sense to move them to relnode.c, which > > is implemented in 0001. > > Thanks for including that! Will review. To avoid losing track of this, I've added this to November CF. https://commitfest.postgresql.org/25/2278/ I know there is one more patch beside the partitionwise join fix, but I've set the title to suggest that this is related mainly to partitionwise joins. Thanks, Amit
Re: d25ea01275 and partitionwise join
On Thu, Jul 18, 2019 at 11:18 AM Amit Langote wrote: > On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita wrote: > > On Tue, Jul 2, 2019 at 6:29 PM Amit Langote wrote: > > > 0001 - fix partitionwise join to work correctly with n-way joins of > > > which some are full joins (+ cosmetic improvements around the code > > > that was touched) > > > > Here are my comments about the cosmetic improvements: they seem pretty > > large to me, so I'd make a separate patch for that. > > OK, my bad that I added so many cosmetic changes into a patch that is > meant to fix the main issue. Just to clarify, I'm proposing these > cosmetic improvements to better clarify the terminological separation > between nullable and non-nullable partition keys, which I found a bit > hard to understand as is. OK, thanks for the explanation! > I've broken the patch into two: 0001 contains only cosmetic changes > and 0002 the fix for handling full joins properly. Would you rather > that be reversed? I like this order. > > In addition, I'd > > move have_partkey_equi_join() and match_expr_to_partition_keys() to > > relnode.c, because these functions are only used in that file. > > I hadn't noticed that. Makes sense to move them to relnode.c, which > is implemented in 0001. Thanks for including that! Will review. Best regards, Etsuro Fujita
Re: d25ea01275 and partitionwise join
Fujita-san, Thanks for looking at this. On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita wrote: > > On Tue, Jul 2, 2019 at 6:29 PM Amit Langote wrote: > > 0001 - fix partitionwise join to work correctly with n-way joins of > > which some are full joins (+ cosmetic improvements around the code > > that was touched) > > Here are my comments about the cosmetic improvements: they seem pretty > large to me, so I'd make a separate patch for that. OK, my bad that I added so many cosmetic changes into a patch that is meant to fix the main issue. Just to clarify, I'm proposing these cosmetic improvements to better clarify the terminological separation between nullable and non-nullable partition keys, which I found a bit hard to understand as is. I've broken the patch into two: 0001 contains only cosmetic changes and 0002 the fix for handling full joins properly. Would you rather that be reversed? > In addition, I'd > move have_partkey_equi_join() and match_expr_to_partition_keys() to > relnode.c, because these functions are only used in that file. I hadn't noticed that. Makes sense to move them to relnode.c, which is implemented in 0001. Thanks, Amit v2-0001-Some-cosmetic-improvements-to-partitionwise-join-.patch Description: Binary data v2-0002-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patch Description: Binary data v2-0003-Add-multi-relation-EC-child-members-in-a-separate.patch Description: Binary data
Re: d25ea01275 and partitionwise join
On Tue, Jul 2, 2019 at 6:29 PM Amit Langote wrote: > 0001 - fix partitionwise join to work correctly with n-way joins of > which some are full joins (+ cosmetic improvements around the code > that was touched) Here are my comments about the cosmetic improvements: they seem pretty large to me, so I'd make a separate patch for that. In addition, I'd move have_partkey_equi_join() and match_expr_to_partition_keys() to relnode.c, because these functions are only used in that file. Best regards, Etsuro Fujita