[ 
https://issues.apache.org/jira/browse/HIVE-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-20332:
-------------------------------------------
    Description: 
Currently, we do not expose stats over {{ROW\_\_ID.writeId}} to the optimizer 
(this should be fixed by HIVE-20313). Even if we did, we always assume uniform 
distribution of the column values, which can easily lead to overestimations on 
the number of rows read when we filter on {{ROW\_\_ID.writeId}} for 
materialized views (think about a large transaction for MV creation and then 
small ones for incremental maintenance). This overestimation can lead to 
incremental view maintenance not being triggered as cost of the incremental 
plan is overestimated (we think we will read more rows than we actually do). 
This could be fixed by introducing histograms that reflect better the column 
values distribution.

Till both fixes are implemented, we will use a config variable that will set 
the selectivity for filter condition on {{ROW\_\_ID}} during the cost 
calculation. Setting that variable to a low value will favour incremental 
rebuild over full rebuild.

  was:
Currently, we do not expose stats over {{ROW\_\_ID.writeId}} to the optimizer. 
Even if we did, we always assume uniform distribution of the column values, 
which can easily lead to overestimations on the number of rows read when we 
filter on {{ROW\_\_ID.writeId}} for materialized views (think about a large 
transaction for MV creation and then small ones for incremental maintenance). 
This overestimation can lead to incremental view maintenance not being 
triggered as cost of the incremental plan is overestimated (we think we will 
read more rows than we actually do). This could be fixed by introducing 
histograms that reflect better the column values distribution.

Till that moment, we will use a config variable that will set the selectivity 
for filter condition on {{ROW\_\_ID}} during the cost calculation. Setting that 
variable to a low value will favour incremental rebuild over full rebuild.


> Materialized views: Introduce heuristic on selectivity over ROW__ID to favour 
> incremental rebuild
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-20332
>                 URL: https://issues.apache.org/jira/browse/HIVE-20332
>             Project: Hive
>          Issue Type: Improvement
>          Components: Materialized views
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>
> Currently, we do not expose stats over {{ROW\_\_ID.writeId}} to the optimizer 
> (this should be fixed by HIVE-20313). Even if we did, we always assume 
> uniform distribution of the column values, which can easily lead to 
> overestimations on the number of rows read when we filter on 
> {{ROW\_\_ID.writeId}} for materialized views (think about a large transaction 
> for MV creation and then small ones for incremental maintenance). This 
> overestimation can lead to incremental view maintenance not being triggered 
> as cost of the incremental plan is overestimated (we think we will read more 
> rows than we actually do). This could be fixed by introducing histograms that 
> reflect better the column values distribution.
> Till both fixes are implemented, we will use a config variable that will set 
> the selectivity for filter condition on {{ROW\_\_ID}} during the cost 
> calculation. Setting that variable to a low value will favour incremental 
> rebuild over full rebuild.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to