[ 
https://issues.apache.org/jira/browse/HIVE-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912264#comment-17912264
 ] 

Shohei Okumiya commented on HIVE-28702:
---------------------------------------

I identified the root cause.

https://github.com/apache/hive/pull/5607

 

> Currently, HiveIcebergStorageHandler#canProvideColStatistics and 
> HiveIcebergStorageHandler#getColStatistics retrieve statistics based on the 
> current snapshot id while HiveIcebergStorageHandler#getBasicStatistics 
> respects conditions of time travel features. It causes column statistics to 
> be inconsistent with basic stats. For example, column stats can say there are 
> 100 null values even though basic stats say the number of total rows is only 
> 10. Hive can get confused when it builds an execution plan.

 

> Statistics are inconsistent on time travel queries
> --------------------------------------------------
>
>                 Key: HIVE-28702
>                 URL: https://issues.apache.org/jira/browse/HIVE-28702
>             Project: Hive
>          Issue Type: Bug
>          Components: Iceberg integration, Statistics
>    Affects Versions: 4.0.1
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-01-12-21-23-02-639.png
>
>
> Time-travel queries using a snapshot id, timestamp, branching, or tagging can 
> run with incorrect statistics.
> This set of queries reproduces the problem.
> {code:java}
> SET hive.fetch.task.conversion=none;
> CREATE TABLE default.test (i1 INT, i2 INT) STORED BY ICEBERG;
> INSERT INTO default.test VALUES (1, 11), (2, 22);
> ALTER TABLE default.test CREATE TAG with_2_records;
> EXPLAIN SELECT * FROM default.test.tag_with_2_records;
> INSERT INTO default.test VALUES (null, null), (null, null), (null, null), 
> (null, null), (null, null), (null, null), (null, null), (null, null), (null, 
> null), (null, null), (null, null), (null, null);
> EXPLAIN SELECT * FROM default.test.tag_with_2_records; {code}
> The first EXPLAIN shows the correct size of statistics, with 2 records.
> {code:java}
> |         Map 1                                      |
> |             Map Operator Tree:                     |
> |                 TableScan                          |
> |                   alias: test                      |
> |                   Snapshot ref: tag_with_2_records |
> |                   Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE |
> |                   Select Operator                  |
> |                     expressions: i1 (type: int), i2 (type: int) |
> |                     outputColumnNames: _col0, _col1 |
> |                     Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE | {code}
> The size is broken after I ran the second INSERT query.
> {code:java}
> |         Map 1                                      |
> |             Map Operator Tree:                     |
> |                 TableScan                          |
> |                   alias: test                      |
> |                   Snapshot ref: tag_with_2_records |
> |                   Statistics: Num rows: 2 Data size: 6610 Basic stats: 
> COMPLETE Column stats: COMPLETE |
> |                   Select Operator                  |
> |                     expressions: i1 (type: int), i2 (type: int) |
> |                     outputColumnNames: _col0, _col1 |
> |                     Statistics: Num rows: 2 Data size: -72 Basic stats: 
> COMPLETE Column stats: COMPLETE | {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to