[
https://issues.apache.org/jira/browse/HIVE-28814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko updated HIVE-28814:
----------------------------------
Component/s: Iceberg integration
> Optimize count(*) queries on Iceberg tables
> -------------------------------------------
>
> Key: HIVE-28814
> URL: https://issues.apache.org/jira/browse/HIVE-28814
> Project: Hive
> Issue Type: Improvement
> Components: Iceberg integration, Statistics
> Affects Versions: 4.0.1
> Reporter: Denys Kuzmenko
> Priority: Major
>
> Simple {{SELECT count( * ) FROM tbl_ice;}} could be optimized.
> 1. If a V2 table doesn't have any delete files, then the cardinality is
> {noformat}
> Cardinality(data files){noformat}
> 2. If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
> SUM
> |
> UNION ALL
> / \
> / \
> / \
> COUNT(*) COUNT(*)
> / \
> SCAN ANTI JOIN
> data files / \
> without / \
> deletes SCAN SCAN
> data files delete files
> with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count(
> * ) optimization (they would only need to read file metadata). In the common
> case (when there are few deletes) this SCAN is in charge of scanning the vast
> majority of data files.
> ref: [https://issues.apache.org/jira/browse/IMPALA-11802]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)