[
https://issues.apache.org/jira/browse/HIVE-24764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan reassigned HIVE-24764:
---------------------------------------
Assignee: Rajesh Balamohan
> insert overwrite on a partition resets row count stats in other partitions
> --------------------------------------------------------------------------
>
> Key: HIVE-24764
> URL: https://issues.apache.org/jira/browse/HIVE-24764
> Project: Hive
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> After insert overwrite on a partition, stats on other partitions are messed
> up. Subsequent queries end up with plans with PARTIAL stats. In certain
> cases, this leads to suboptimal query plans.
> {noformat}
> drop table if exists test_stats;
> drop table if exists test_stats_2;
> create table test_stats(i int, j bigint);
> create table test_stats_2(i int) partitioned by (j bigint);
> insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> -- select * from test_stats;
> 1 1
> 2 2
> 3 3
> 4 4
> 5 <null>
> insert overwrite table test_stats_2 partition(j) select i, j from test_stats
> where j is not null;
> -- After executing this statement, stat gets messed up.
> insert overwrite table test_stats_2 partition(j) select i, j from test_stats
> where j is null;
> -- select * from test_stats_2;
> 1 1
> 2 2
> 3 3
> 4 4
> 5 <null>
> -- This would return "PARTIAL" stats instead of "COMPLETE"
> explain select i, count(*) as c from test_stats_2 group by i order by c desc
> limit 10;
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
> DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: test_stats_2
> Statistics: Num rows: 125 Data size: 500 Basic stats:
> PARTIAL Column stats: COMPLETE
> Select Operator
> expressions: i (type: int)
> outputColumnNames: i
> Statistics: Num rows: 125 Data size: 500 Basic stats:
> PARTIAL Column stats: COMPLETE
> Group By Operator
> aggregations: count()
> keys: i (type: int)
> minReductionHashAggr: 0.99
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 125 Data size: 1500 Basic stats:
> PARTIAL Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 125 Data size: 1500 Basic
> stats: PARTIAL Column stats: COMPLETE
> value expressions: _col1 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: may be used (ACID table)
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL
> Column stats: COMPLETE
> Top N Key Operator
> sort order: -
> keys: _col1 (type: bigint)
> null sort order: a
> Statistics: Num rows: 62 Data size: 744 Basic stats:
> PARTIAL Column stats: COMPLETE
> top n: 10
> Reduce Output Operator
> key expressions: _col1 (type: bigint)
> null sort order: a
> sort order: -
> Statistics: Num rows: 62 Data size: 744 Basic stats:
> PARTIAL Column stats: COMPLETE
> TopN Hash Memory Usage: 0.04
> value expressions: _col0 (type: int)
> Reducer 3
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Select Operator
> expressions: VALUE._col0 (type: int), KEY.reducesinkkey0
> (type: bigint)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL
> Column stats: COMPLETE
> Limit
> Number of rows: 10
> Statistics: Num rows: 10 Data size: 120 Basic stats:
> PARTIAL Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 10 Data size: 120 Basic stats:
> PARTIAL Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: 10
> Processor Tree:
> ListSink
> {noformat}
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsAggregator.java#L138
> FSStats should return "null" or "" if partitions are not present in its list.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)