[
https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765512#comment-15765512
]
Prasanth Jayachandran commented on HIVE-15477:
----------------------------------------------
There are also places where worst case is assumed to be 1/3rd (inequality
operator). FilterStatsRule javadoc has more context on that.
The issue is two fold, without column statistics and histograms we are always
going to estimate wrongly, we also don't know how much will the row expand
in-memory when added to hashtable.
To control these we already have configs. hive.stats.join.factor and
hive.stats.deserialization.factor (this will accounted to data size computation
to say how much the data expands in-memory when added to hashtable). Tuning
these 2 should help with join OOM issues for map join. Since map join is
determined based on data size and we already have a config to tune it, I don't
think we need a config to control the number of rows. Thoughts?
> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
> Key: HIVE-15477
> URL: https://issues.apache.org/jira/browse/HIVE-15477
> Project: Hive
> Issue Type: Bug
> Components: Statistics
> Affects Versions: 2.2.0
> Reporter: Chao Sun
> Assignee: Chao Sun
> Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst"
> case by setting the # of output rows to be 1/2 of the # of input rows, for
> each predicate expression. This could be inaccurate, especially in the
> presence of multiple predicates chained by AND. We have found in some cases
> this could cause map join to have wrong ordering and thus fail with memory
> issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}})
> that can be used to control the percentage of rows emitted by a predicate
> expression.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)