IMPALA-5120: Default to partitioned join when stats are missing Previously, we defaulted to broadcast join when stats were missing, but this can lead to disastrous plans when the right hand side is actually large.
Its always difficult to make good plans when stats are missing, but defaulting to partitioned joins should reduce the risk of disastrous plans. Testing: - Added a planner test that joins a table with no stats. Change-Id: Ie168ecfcd5e7c5d3c60d16926c151f8f134c81e0 Reviewed-on: http://gerrit.cloudera.org:8080/6803 Reviewed-by: Thomas Tauber-Marshall <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/aca07ee8 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/aca07ee8 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/aca07ee8 Branch: refs/heads/master Commit: aca07ee8160bbea0812dc4ba3c08dff818240d22 Parents: 374f112 Author: Thomas Tauber-Marshall <[email protected]> Authored: Thu May 4 13:51:08 2017 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Mon May 8 19:05:11 2017 +0000 ---------------------------------------------------------------------- .../impala/planner/DistributedPlanner.java | 3 ++- .../queries/PlannerTest/joins.test | 22 ++++++++++++++++++++ 2 files changed, 24 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/aca07ee8/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java ---------------------------------------------------------------------- diff --git a/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java b/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java index e0e325c..83c8ccb 100644 --- a/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java +++ b/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java @@ -458,7 +458,8 @@ public class DistributedPlanner { // repartition: both left- and rightChildFragment are partitioned on the // join exprs, and a hash table is built with the rightChildFragment's output. PlanNode lhsTree = leftChildFragment.getPlanRoot(); - long partitionCost = Long.MAX_VALUE; + // Subtract 1 here so that if stats are missing we default to partitioned. + long partitionCost = Long.MAX_VALUE - 1; List<Expr> lhsJoinExprs = Lists.newArrayList(); List<Expr> rhsJoinExprs = Lists.newArrayList(); for (Expr joinConjunct: node.getEqJoinConjuncts()) { http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/aca07ee8/testdata/workloads/functional-planner/queries/PlannerTest/joins.test ---------------------------------------------------------------------- diff --git a/testdata/workloads/functional-planner/queries/PlannerTest/joins.test b/testdata/workloads/functional-planner/queries/PlannerTest/joins.test index 0fdb19d..26b8c64 100644 --- a/testdata/workloads/functional-planner/queries/PlannerTest/joins.test +++ b/testdata/workloads/functional-planner/queries/PlannerTest/joins.test @@ -2519,3 +2519,25 @@ PLAN-ROOT SINK 00:SCAN HDFS [tpch.customer a] partitions=1/1 files=1 size=23.08MB ==== +# If stats aren't available, default to partitioned join. +select * from functional.tinytable x, functional.tinytable y where x.a = y.a +---- DISTRIBUTEDPLAN +PLAN-ROOT SINK +| +05:EXCHANGE [UNPARTITIONED] +| +02:HASH JOIN [INNER JOIN, PARTITIONED] +| hash predicates: x.a = y.a +| runtime filters: RF000 <- y.a +| +|--04:EXCHANGE [HASH(y.a)] +| | +| 01:SCAN HDFS [functional.tinytable y] +| partitions=1/1 files=1 size=38B +| +03:EXCHANGE [HASH(x.a)] +| +00:SCAN HDFS [functional.tinytable x] + partitions=1/1 files=1 size=38B + runtime filters: RF000 -> x.a +====
