Alex Behm has posted comments on this change.

Change subject: IMPALA-3063: Separate join inversion from join ordering.
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3846/2//COMMIT_MSG
Commit Message:

Line 7: IMPALA-3063: Separate join inversion from join ordering.
> the changes in test output make it look like some plans also change (so thi
Correct. Let me explain why it is not only a separation of functionality.

1. Our join cardinality estimation is not symmetric, i.e., A JOIN B may not 
give the same estimate as B JOIN A due to our FK/PK detection heuristic. In the 
context of this patch this means that an inverted join may have a different 
cardinality estimate, so plans may change depending on whether the inversion is 
done during join ordering of after.

2. We currently only invert outer/semi/anti joins based on the rhs table ref 
join op. In this patch I want to preserve the existing behavior as much as 
possible, but when doing the join ordering in a separate pass we may see a join 
opn in a JoinNode that is different from the rhs table ref. So in some 
situations the inversion behavior based on the join op could be different and 
there are some examples in this patch.


It's hard to say which method produces better plans, but the post-ordering join 
inversion seems simpler and easier to reason about. Maybe you have other 
thoughts?


http://gerrit.cloudera.org:8080/#/c/3846/2/testdata/workloads/functional-planner/queries/PlannerTest/outer-joins.test
File testdata/workloads/functional-planner/queries/PlannerTest/outer-joins.test:

Line 814: |  |--07:HASH JOIN [RIGHT OUTER JOIN]
> why is this cross product flipped?
Good catch. The cardinality estimates are clear: Join 08 should indeed be 
inverted.

However, I'm trying to minimize the behavioral changes in this patch and I 
missed the case of non-equi inner joins which we should not invert if we want 
to preserve the existing behavior.


http://gerrit.cloudera.org:8080/#/c/3846/2/testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
File testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test:

Line 1088: |  hash predicates: l_suppkey = s_suppkey, l_partkey = ps_partkey
> we've seen meaningless changes like this before. i'm wondering whether we s
We do order them based on cost, but we don't have a tie breaker to get an 
absolute order.

I tried using toSql() as a tie breaker which is good enough for the explain use 
case (still not an absolute order).
Unfortunately, that changes a ton of plans, so I'd prefer to do that in a 
follow-on patch. I believe we even have a JIRA for that. Sound good?


Line 1777: |  hash predicates: p_partkey = ps_partkey
> is this the right thing to do, given selectivities?
Another good catch. The change was only coincidentally caused by this patch. 
There was an independent bug in estimating cross join cardinalities which lead 
us to compute different cardinalities for:

NESTED_LOOP_JOIN
  UNNEST
SINGULAR_ROW_SRC 

versus.

NESTED_LOOP_JOIN
  SINGULAR_ROW_SRC
UNNEST

That difference caused a plan change here because during join ordering we had a 
different version than before.


-- 
To view, visit http://gerrit.cloudera.org:8080/3846
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If86db7753fc585bb4c69612745ec0103278888a4
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-HasComments: Yes

Reply via email to