[ https://issues.apache.org/jira/browse/HIVE-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912264#comment-17912264 ]
Shohei Okumiya commented on HIVE-28702: --------------------------------------- I identified the root cause. https://github.com/apache/hive/pull/5607 > Currently, HiveIcebergStorageHandler#canProvideColStatistics and > HiveIcebergStorageHandler#getColStatistics retrieve statistics based on the > current snapshot id while HiveIcebergStorageHandler#getBasicStatistics > respects conditions of time travel features. It causes column statistics to > be inconsistent with basic stats. For example, column stats can say there are > 100 null values even though basic stats say the number of total rows is only > 10. Hive can get confused when it builds an execution plan. > Statistics are inconsistent on time travel queries > -------------------------------------------------- > > Key: HIVE-28702 > URL: https://issues.apache.org/jira/browse/HIVE-28702 > Project: Hive > Issue Type: Bug > Components: Iceberg integration, Statistics > Affects Versions: 4.0.1 > Reporter: Shohei Okumiya > Assignee: Shohei Okumiya > Priority: Major > Labels: pull-request-available > Attachments: image-2025-01-12-21-23-02-639.png > > > Time-travel queries using a snapshot id, timestamp, branching, or tagging can > run with incorrect statistics. > This set of queries reproduces the problem. > {code:java} > SET hive.fetch.task.conversion=none; > CREATE TABLE default.test (i1 INT, i2 INT) STORED BY ICEBERG; > INSERT INTO default.test VALUES (1, 11), (2, 22); > ALTER TABLE default.test CREATE TAG with_2_records; > EXPLAIN SELECT * FROM default.test.tag_with_2_records; > INSERT INTO default.test VALUES (null, null), (null, null), (null, null), > (null, null), (null, null), (null, null), (null, null), (null, null), (null, > null), (null, null), (null, null), (null, null); > EXPLAIN SELECT * FROM default.test.tag_with_2_records; {code} > The first EXPLAIN shows the correct size of statistics, with 2 records. > {code:java} > | Map 1 | > | Map Operator Tree: | > | TableScan | > | alias: test | > | Snapshot ref: tag_with_2_records | > | Statistics: Num rows: 2 Data size: 16 Basic stats: > COMPLETE Column stats: COMPLETE | > | Select Operator | > | expressions: i1 (type: int), i2 (type: int) | > | outputColumnNames: _col0, _col1 | > | Statistics: Num rows: 2 Data size: 16 Basic stats: > COMPLETE Column stats: COMPLETE | {code} > The size is broken after I ran the second INSERT query. > {code:java} > | Map 1 | > | Map Operator Tree: | > | TableScan | > | alias: test | > | Snapshot ref: tag_with_2_records | > | Statistics: Num rows: 2 Data size: 6610 Basic stats: > COMPLETE Column stats: COMPLETE | > | Select Operator | > | expressions: i1 (type: int), i2 (type: int) | > | outputColumnNames: _col0, _col1 | > | Statistics: Num rows: 2 Data size: -72 Basic stats: > COMPLETE Column stats: COMPLETE | {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)