[
https://issues.apache.org/jira/browse/DRILL-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053758#comment-15053758
]
Aman Sinha commented on DRILL-4188:
-----------------------------------
Only if we can get the NDV stats for the columns, right ? That would take some
time whereas this issue (severe skew) seems to be occurring in some user
deployments.
> Change the default value of planner.enable_hash_single_key to false
> -------------------------------------------------------------------
>
> Key: DRILL-4188
> URL: https://issues.apache.org/jira/browse/DRILL-4188
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 1.4.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
>
> The planner.enable_hash_single_key flag is used by the HashJoin and MergeJoin
> plans to do hash distribution on both sides of the join when it is a
> multi-column join (e.g T1.a1 = T2.a2 AND T1.b1 = T2.b2). The default value
> of this parameter is True, which means that Drill will generate multiple
> plans each with hash distribute on only 1 column. The final plan chosen is
> based on costing.
> However, due to lack of column statistics, this approach is problematic
> because we could end up picking the first column for hash distribution if all
> plans cost the same and if this column has low number of distinct values,
> there could be substantial skew in distribution.
> Doing the hash distribution on all columns should be the default, so I
> propose to change planner.enable_hash_single_key to False. The scenario
> where we might still want single column hash distribution is when the join is
> done after some other operation (e.g window function, grouped-aggregation)
> where the child already does a hash-distribution on 1 column that is part of
> the join. However, for those case, we may want to selectively enable this
> flag.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)