[
https://issues.apache.org/jira/browse/HIVE-28814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko updated HIVE-28814:
----------------------------------
Description:
Simple SELECT count( * ) FROM tbl_ice; could be optimized.
1. If a V2 table doesn't have any delete files, then the cardinality is
Cardinality(data files)
2. If the above is not true, we can still optimize count( * ) queries by:
SUM
|
UNION ALL
/ \
/ \
/ \
COUNT(*) COUNT(*)
/ \
SCAN ANTI JOIN
data files / \
without / \
deletes SCAN SCAN
data files delete files
with deletes
The SCAN operator with "data files without deletes" could benefit from count( *
) optimization (they would only need to read file metadata). In the common case
(when there are few deletes) this SCAN is in charge of scanning the vast
majority of data files.
ref: https://issues.apache.org/jira/browse/IMPALA-11802
> Optimize count(*) queries on Iceberg tables
> -------------------------------------------
>
> Key: HIVE-28814
> URL: https://issues.apache.org/jira/browse/HIVE-28814
> Project: Hive
> Issue Type: Test
> Reporter: Denys Kuzmenko
> Priority: Major
>
> Simple SELECT count( * ) FROM tbl_ice; could be optimized.
> 1. If a V2 table doesn't have any delete files, then the cardinality is
> Cardinality(data files)
> 2. If the above is not true, we can still optimize count( * ) queries by:
> SUM
> |
> UNION ALL
> / \
> / \
> / \
> COUNT(*) COUNT(*)
> / \
> SCAN ANTI JOIN
> data files / \
> without / \
> deletes SCAN SCAN
> data files delete files
> with deletes
> The SCAN operator with "data files without deletes" could benefit from count(
> * ) optimization (they would only need to read file metadata). In the common
> case (when there are few deletes) this SCAN is in charge of scanning the vast
> majority of data files.
> ref: https://issues.apache.org/jira/browse/IMPALA-11802
--
This message was sent by Atlassian Jira
(v8.20.10#820010)