[
https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768176#comment-15768176
]
Chao Sun commented on HIVE-15477:
---------------------------------
[~prasanth_j] can you elaborate on what mis-estimate can be done with
"join_key_column IS NOT NULL" predicates? I'm also curious why it is added to
Hive. I was looking at {{evaluateNotNullExpr}} but seems it just return the
input # of rows when column stats are not present? (looking at here:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L586)
Yeah I totally agree that we are going to make wrong estimates even with
configs. It's very difficult to get 100% accurate stats. But with some configs
we can at least add some manual intervention. :)
> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
> Key: HIVE-15477
> URL: https://issues.apache.org/jira/browse/HIVE-15477
> Project: Hive
> Issue Type: Bug
> Components: Statistics
> Affects Versions: 2.2.0
> Reporter: Chao Sun
> Assignee: Chao Sun
> Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst"
> case by setting the # of output rows to be 1/2 of the # of input rows, for
> each predicate expression. This could be inaccurate, especially in the
> presence of multiple predicates chained by AND. We have found in some cases
> this could cause map join to have wrong ordering and thus fail with memory
> issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}})
> that can be used to control the percentage of rows emitted by a predicate
> expression.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)