[ https://issues.apache.org/jira/browse/HIVE-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572088#comment-16572088 ]
Jesus Camacho Rodriguez commented on HIVE-20332: ------------------------------------------------ [~ekoifman], agree. HIVE-20313 plus actual column values distribution information will be needed in the longer term to make this a cost-based decision instead of a heuristic one. > Materialized views: Introduce heuristic on selectivity over ROW__ID to favour > incremental rebuild > ------------------------------------------------------------------------------------------------- > > Key: HIVE-20332 > URL: https://issues.apache.org/jira/browse/HIVE-20332 > Project: Hive > Issue Type: Improvement > Components: Materialized views > Reporter: Jesus Camacho Rodriguez > Assignee: Jesus Camacho Rodriguez > Priority: Major > > Currently, we do not expose stats over {{ROW\_\_ID.writeId}} to the > optimizer. Even if we did, we always assume uniform distribution of the > column values, which can easily lead to overestimations on the number of rows > read when we filter on {{ROW\_\_ID.writeId}} for materialized views (think > about a large transaction for MV creation and then small ones for incremental > maintenance). This overestimation can lead to incremental view maintenance > not being triggered as cost of the incremental plan is overestimated (we > think we will read more rows than we actually do). This could be fixed by > introducing histograms that reflect better the column values distribution. > Till that moment, we will use a config variable that will set the selectivity > for filter condition on {{ROW\_\_ID}} during the cost calculation. Setting > that variable to a low value will favour incremental rebuild over full > rebuild. -- This message was sent by Atlassian JIRA (v7.6.3#76005)