Alex Behm has posted comments on this change. Change subject: IMPALA-3063: Separate join inversion from join ordering. ......................................................................
Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/3846/2//COMMIT_MSG Commit Message: Line 7: IMPALA-3063: Separate join inversion from join ordering. > the changes in test output make it look like some plans also change (so thi Correct. Let me explain why it is not only a separation of functionality. 1. Our join cardinality estimation is not symmetric, i.e., A JOIN B may not give the same estimate as B JOIN A due to our FK/PK detection heuristic. In the context of this patch this means that an inverted join may have a different cardinality estimate, so plans may change depending on whether the inversion is done during join ordering of after. 2. We currently only invert outer/semi/anti joins based on the rhs table ref join op. In this patch I want to preserve the existing behavior as much as possible, but when doing the join ordering in a separate pass we may see a join opn in a JoinNode that is different from the rhs table ref. So in some situations the inversion behavior based on the join op could be different and there are some examples in this patch. It's hard to say which method produces better plans, but the post-ordering join inversion seems simpler and easier to reason about. Maybe you have other thoughts? http://gerrit.cloudera.org:8080/#/c/3846/2/testdata/workloads/functional-planner/queries/PlannerTest/outer-joins.test File testdata/workloads/functional-planner/queries/PlannerTest/outer-joins.test: Line 814: | |--07:HASH JOIN [RIGHT OUTER JOIN] > why is this cross product flipped? Good catch. The cardinality estimates are clear: Join 08 should indeed be inverted. However, I'm trying to minimize the behavioral changes in this patch and I missed the case of non-equi inner joins which we should not invert if we want to preserve the existing behavior. http://gerrit.cloudera.org:8080/#/c/3846/2/testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test File testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test: Line 1088: | hash predicates: l_suppkey = s_suppkey, l_partkey = ps_partkey > we've seen meaningless changes like this before. i'm wondering whether we s We do order them based on cost, but we don't have a tie breaker to get an absolute order. I tried using toSql() as a tie breaker which is good enough for the explain use case (still not an absolute order). Unfortunately, that changes a ton of plans, so I'd prefer to do that in a follow-on patch. I believe we even have a JIRA for that. Sound good? Line 1777: | hash predicates: p_partkey = ps_partkey > is this the right thing to do, given selectivities? Another good catch. The change was only coincidentally caused by this patch. There was an independent bug in estimating cross join cardinalities which lead us to compute different cardinalities for: NESTED_LOOP_JOIN UNNEST SINGULAR_ROW_SRC versus. NESTED_LOOP_JOIN SINGULAR_ROW_SRC UNNEST That difference caused a plan change here because during join ordering we had a different version than before. -- To view, visit http://gerrit.cloudera.org:8080/3846 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If86db7753fc585bb4c69612745ec0103278888a4 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-HasComments: Yes
