[
https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765724#comment-15765724
]
Prasanth Jayachandran commented on HIVE-15477:
----------------------------------------------
Agreed that the compound effect of chained AND can be bad when there is
absolutely no stats. But the primary issue is when we don't have stats we
operate on the file size and estimate the number of rows (which can be way off
because of compression). On top of that we introduce "join_key_column IS NOT
NULL" predicates. This can dramatically mis-estimate the data size as we don't
have knowledge about the compression or any statistics. That's the primary
reason for adding the configs initially.
There are 2 ends of spectrum for the cardinality estimation, assuming the
column has NDV of 1 then all rows are selected and if NDV == numRows then
numRows/2 will be estimated to be selected which is not bad (when column stats
is present this is going to estimate numRows/NDV which is also not correct when
there is skew). In case of chained AND predicates we are going to filter more
rows by definition which will be worse without stats.
So my point is even with configs we are going to make wrong estimations :)
I would rather ignore the "join_key_column IS NOT NULL" from affecting the
stats in the worst case. Preferably, we want the optimizer to take care of such
scenarios than users. IMHO we should also consider ignoring the
"join_key_column IS NOT NULL" that is introduced by hive. Not sure which
optimizer introduces this predicate. May be [~jcamachorodriguez] has idea?
I am not against adding another config but it will be better if optimizer can
be make better decision for worst case scenarios with or without the config.
> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
> Key: HIVE-15477
> URL: https://issues.apache.org/jira/browse/HIVE-15477
> Project: Hive
> Issue Type: Bug
> Components: Statistics
> Affects Versions: 2.2.0
> Reporter: Chao Sun
> Assignee: Chao Sun
> Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst"
> case by setting the # of output rows to be 1/2 of the # of input rows, for
> each predicate expression. This could be inaccurate, especially in the
> presence of multiple predicates chained by AND. We have found in some cases
> this could cause map join to have wrong ordering and thus fail with memory
> issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}})
> that can be used to control the percentage of rows emitted by a predicate
> expression.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)