[jira] [Updated] (HIVE-28814) Optimize count(*) queries on Iceberg tables

Denys Kuzmenko (Jira) Tue, 11 Mar 2025 07:57:49 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Denys Kuzmenko updated HIVE-28814:
----------------------------------
    Component/s: Iceberg integration

> Optimize count(*) queries on Iceberg tables
> -------------------------------------------
>
>                 Key: HIVE-28814
>                 URL: https://issues.apache.org/jira/browse/HIVE-28814
>             Project: Hive
>          Issue Type: Improvement
>          Components: Iceberg integration, Statistics
>    Affects Versions: 4.0.1
>            Reporter: Denys Kuzmenko
>            Priority: Major
>
> Simple {{SELECT count( * ) FROM tbl_ice;}} could be optimized.
> 1. If a V2 table doesn't have any delete files, then the cardinality is
> {noformat}
> Cardinality(data files){noformat}
> 2. If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
>         SUM
>          |
>      UNION ALL
>       /     \
>      /       \
>     /         \
> COUNT(*)     COUNT(*)
>   /                \
> SCAN             ANTI JOIN
> data files         /      \
> without           /        \
> deletes       SCAN         SCAN
>               data files   delete files
>               with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count( 
> * ) optimization (they would only need to read file metadata). In the common 
> case (when there are few deletes) this SCAN is in charge of scanning the vast 
> majority of data files.
> ref: [https://issues.apache.org/jira/browse/IMPALA-11802]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28814) Optimize count(*) queries on Iceberg tables

Reply via email to