[
https://issues.apache.org/jira/browse/IMPALA-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Kaszab updated IMPALA-11802:
----------------------------------
Parent: IMPALA-12087
Issue Type: Sub-task (was: Bug)
> Optimize count(*) queries for Iceberg V2 tables
> -----------------------------------------------
>
> Key: IMPALA-11802
> URL: https://issues.apache.org/jira/browse/IMPALA-11802
> Project: IMPALA
> Issue Type: Sub-task
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: Li Penglin
> Priority: Major
> Labels: impala-iceberg
> Fix For: Impala 4.3.0
>
>
> Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.
> At first we need to investigate if the following is true:
> If a V2 table only has position delete files, then the cardinality is
> {noformat}
> Cardinality(data files) - Cardinality(delete files)
> {noformat}
> If this is true, then we can answer count( * ) queries via a query rewrite
> similarly to what we do for V1 tables: IMPALA-11279
> If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
> SUM
> |
> UNION ALL
> / \
> / \
> / \
> COUNT(*) COUNT(*)
> / \
> SCAN ANTI JOIN
> data files / \
> without / \
> deletes SCAN SCAN
> data files delete files
> with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count(
> * ) optimization (they would only need to read file metadata). In the common
> case (when there are few deletes) this SCAN is in charge of scanning the vast
> majority of data files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]