[ 
https://issues.apache.org/jira/browse/HIVE-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912260#comment-17912260
 ] 

Shohei Okumiya commented on HIVE-28702:
---------------------------------------

I guess the basic stats, i.e., # of rows, are correctly retrieved based on the 
snapshot, but column stats(column statistics) are retrieved from the current 
snapshot. I see # of nulls is greater than the number of rows in the case.

!image-2025-01-12-21-23-02-639.png!

> Time travel queries calculate incorrect statistics
> --------------------------------------------------
>
>                 Key: HIVE-28702
>                 URL: https://issues.apache.org/jira/browse/HIVE-28702
>             Project: Hive
>          Issue Type: Bug
>          Components: Iceberg integration, Statistics
>    Affects Versions: 4.0.1
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>         Attachments: image-2025-01-12-21-23-02-639.png
>
>
> Time-travel queries using a snapshot id, timestamp, branching, or tagging can 
> run with incorrect statistics.
> This set of queries reproduces the problem.
> {code:java}
> SET hive.fetch.task.conversion=none;
> CREATE TABLE default.test (i1 INT, i2 INT) STORED BY ICEBERG;
> INSERT INTO default.test VALUES (1, 11), (2, 22);
> ALTER TABLE default.test CREATE TAG with_2_records;
> EXPLAIN SELECT * FROM default.test.tag_with_2_records;
> INSERT INTO default.test VALUES (null, null), (null, null), (null, null), 
> (null, null), (null, null), (null, null), (null, null), (null, null), (null, 
> null), (null, null), (null, null), (null, null);
> EXPLAIN SELECT * FROM default.test.tag_with_2_records; {code}
> The first EXPLAIN shows the correct size of statistics, with 2 records.
> {code:java}
> |         Map 1                                      |
> |             Map Operator Tree:                     |
> |                 TableScan                          |
> |                   alias: test                      |
> |                   Snapshot ref: tag_with_2_records |
> |                   Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE |
> |                   Select Operator                  |
> |                     expressions: i1 (type: int), i2 (type: int) |
> |                     outputColumnNames: _col0, _col1 |
> |                     Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE | {code}
> The size is broken after I ran the second INSERT query.
> {code:java}
> |         Map 1                                      |
> |             Map Operator Tree:                     |
> |                 TableScan                          |
> |                   alias: test                      |
> |                   Snapshot ref: tag_with_2_records |
> |                   Statistics: Num rows: 2 Data size: 6610 Basic stats: 
> COMPLETE Column stats: COMPLETE |
> |                   Select Operator                  |
> |                     expressions: i1 (type: int), i2 (type: int) |
> |                     outputColumnNames: _col0, _col1 |
> |                     Statistics: Num rows: 2 Data size: -72 Basic stats: 
> COMPLETE Column stats: COMPLETE | {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to