[ 
https://issues.apache.org/jira/browse/HIVE-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572088#comment-16572088
 ] 

Jesus Camacho Rodriguez commented on HIVE-20332:
------------------------------------------------

[~ekoifman], agree. HIVE-20313 plus actual column values distribution 
information will be needed in the longer term to make this a cost-based 
decision instead of a heuristic one.

> Materialized views: Introduce heuristic on selectivity over ROW__ID to favour 
> incremental rebuild
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-20332
>                 URL: https://issues.apache.org/jira/browse/HIVE-20332
>             Project: Hive
>          Issue Type: Improvement
>          Components: Materialized views
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>
> Currently, we do not expose stats over {{ROW\_\_ID.writeId}} to the 
> optimizer. Even if we did, we always assume uniform distribution of the 
> column values, which can easily lead to overestimations on the number of rows 
> read when we filter on {{ROW\_\_ID.writeId}} for materialized views (think 
> about a large transaction for MV creation and then small ones for incremental 
> maintenance). This overestimation can lead to incremental view maintenance 
> not being triggered as cost of the incremental plan is overestimated (we 
> think we will read more rows than we actually do). This could be fixed by 
> introducing histograms that reflect better the column values distribution.
> Till that moment, we will use a config variable that will set the selectivity 
> for filter condition on {{ROW\_\_ID}} during the cost calculation. Setting 
> that variable to a low value will favour incremental rebuild over full 
> rebuild.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to