[jira] [Commented] (HIVE-15477) Provide options to adjust filter stats when column stats are not available

Prasanth Jayachandran (JIRA) Tue, 20 Dec 2016 15:04:53 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765512#comment-15765512
 ]


Prasanth Jayachandran commented on HIVE-15477:
----------------------------------------------

There are also places where worst case is assumed to be 1/3rd (inequality 
operator). FilterStatsRule javadoc has more context on that. 
The issue is two fold, without column statistics and histograms we are always 
going to estimate wrongly, we also don't know how much will the row expand 
in-memory when added to hashtable.  
To control these we already have configs. hive.stats.join.factor and 
hive.stats.deserialization.factor (this will accounted to data size computation 
to say how much the data expands in-memory when added to hashtable). Tuning 
these 2 should help with join OOM issues for map join. Since map join is 
determined based on data size and we already have a config to tune it, I don't 
think we need a config to control the number of rows. Thoughts?


> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
>                 Key: HIVE-15477
>                 URL: https://issues.apache.org/jira/browse/HIVE-15477
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>    Affects Versions: 2.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst" 
> case by setting the # of output rows to be 1/2 of the # of input rows, for 
> each predicate expression. This could be inaccurate, especially in the 
> presence of multiple predicates chained by AND. We have found in some cases 
> this could cause map join to have wrong ordering and thus fail with memory 
> issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}}) 
> that can be used to control the percentage of rows emitted by a predicate 
> expression. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15477) Provide options to adjust filter stats when column stats are not available

Reply via email to