Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/21239 )
Change subject: IMPALA-13043: Implement Join Capability to the Calcite Planner ...................................................................... Patch Set 16: (4 comments) http://gerrit.cloudera.org:8080/#/c/21239/16//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21239/16//COMMIT_MSG@28 PS16, Line 28: There is some special case logic in the join that requires the execution : engine to run on a single node. The original logic can be found in the : JoinNode planner object, but this code isn't executed. The new mechanism : for checking single node executions is passed back to the root via the : NodeWithExprs object, and is checked in the ImpalaJoinRel.useSingleNode : method. > Ok...I had to reexamine this. I didn't code this part the first time and pu Impala's current planner will throw an error for some SQL statements that can't run in a distributed way. For example, a nested loop join with a FULL OUTER join will fail at that location in SingleNodePlanner.validatePlan(). Example SQL: select * from functional.alltypestiny a full outer join functional.alltypessmall b on a.id != b.id or a.int_col != b.int_col where a.id < 10; A user can get around it by setting num_nodes=1. It's a little bit weird, but that is how Impala works today. Maybe an error can be useful, because sometimes users forget to add a join condition and don't realize they are doing something that will be single-threaded. invertJoins() can convert nested loop join with RIGHT OUTER or RIGHT SEMI into LEFT OUTER / LEFT SEMI joins, which are compatible with being distributed. So, I don't think we want to fail before that runs. (invertJoins() will eventually be a Calcite rule, but for now it can handle this case.) I think we could drop the num_nodes=1 code and just let it fail at SingleNodePlanner.validatePlan(). I don't think that would impact any of the TPC queries. http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/functions/AnalyzedBinaryCompExpr.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/functions/AnalyzedBinaryCompExpr.java: http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/functions/AnalyzedBinaryCompExpr.java@65 PS16, Line 65: return 1; > Yeah, IIRC (and I may be misremembering) I think it crashed in PlanNode.ord Ok, makes sense http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/rel/node/ImpalaJoinRel.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/rel/node/ImpalaJoinRel.java: http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/rel/node/ImpalaJoinRel.java@258 PS16, Line 258: // Mapping it to a Left Semi Join in Impala seems to make : // sense since it is unclear when we would need a : // Right Semi Join : return JoinOperator.LEFT_SEMI_JOIN; > Yeah, we're probably going to have to make some Calcite changes to handle t Yeah, this is not a blocker for this change. Let's update this comment to say that Calcite SEMI is always a LEFT SEMI. http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/service/ExecRequestCreator.java File java/calcite-planner/src/main/java/org/apache/impala/calcite/service/ExecRequestCreator.java: http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/service/ExecRequestCreator.java@120 PS16, Line 120: if (useSingleNode) { : plannerContext.getQueryOptions().setNum_nodes(1); : } > My hope is that by the end of this project, there is no need to call "inver There are certain joins that we don't support for distributed plans. In particular, nested loop join does not support distributed RIGHT OUTER or RIGHT SEMI. So, the current planner prefers to flip those to LEFT OUTER and LEFT SEMI so it can use distributed execution. It doesn't do that based on cost right now, but there is an implicit calculation that a distributed plan is better than a single-node plan. The join inversion can save this from hitting the condition in SingleNodePlanner::validatePlan(). -- To view, visit http://gerrit.cloudera.org:8080/21239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5db097577907d79877f52feff2922000af074ecd Gerrit-Change-Number: 21239 Gerrit-PatchSet: 16 Gerrit-Owner: Steve Carlin <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Steve Carlin <[email protected]> Gerrit-Comment-Date: Tue, 03 Sep 2024 19:00:55 +0000 Gerrit-HasComments: Yes
