[
https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765601#comment-15765601
]
Chao Sun commented on HIVE-15477:
---------------------------------
Thanks for the quick reply [~prasanth_j]! Yes I'm fine if the existing configs
can work.
However, I think the issue is that *the compound effect of chained AND
predicates could be really dramatic*, and it could be pretty common since often
one has multiple filter conditions in complex data analysis.
For instance, imaging you have a big fact table *A* with many filters and a
small dimension table *B*. When stats are correct, one should put *B* in the
build side. However, due to the incorrect stats on *A*'s branch, the # of rows
in the "final" stats could be really small, and thus Hive will put *A* on the
build side and *B* on the probe side, which may then fail because the eventual
# of output rows from *A*'s branch could be much bigger than the stats
indicated.
I'm not sure if any of the current config can help with the above case (besides
mapjoin hint, which is outdated). Also I think the issue is not limited to the
mapjoin case, but any query with filters. The
{{hive.stats.filter.predicate.factor}} provides user with an option to control
the degree of optimism on the filtering.
Any further thought?
> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
> Key: HIVE-15477
> URL: https://issues.apache.org/jira/browse/HIVE-15477
> Project: Hive
> Issue Type: Bug
> Components: Statistics
> Affects Versions: 2.2.0
> Reporter: Chao Sun
> Assignee: Chao Sun
> Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst"
> case by setting the # of output rows to be 1/2 of the # of input rows, for
> each predicate expression. This could be inaccurate, especially in the
> presence of multiple predicates chained by AND. We have found in some cases
> this could cause map join to have wrong ordering and thus fail with memory
> issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}})
> that can be used to control the percentage of rows emitted by a predicate
> expression.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)