[jira] [Updated] (HIVE-28814) Optimize count(*) queries on Iceberg tables

Denys Kuzmenko (Jira) Tue, 11 Mar 2025 01:53:06 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Denys Kuzmenko updated HIVE-28814:
----------------------------------
    Description: 
Simple {{SELECT count( * ) FROM tbl_ice;}} could be optimized.

1. If a V2 table doesn't have any delete files, then the cardinality is
{noformat}
Cardinality(data files){noformat}
2. If the above is not true, we can still optimize count( * ) queries by:
{noformat}
        SUM
         |
     UNION ALL
      /     \
     /       \
    /         \
COUNT(*)     COUNT(*)
  /                \
SCAN             ANTI JOIN
data files         /      \
without           /        \
deletes       SCAN         SCAN
              data files   delete files
              with deletes
{noformat}
The SCAN operator with "data files without deletes" could benefit from count( * 
) optimization (they would only need to read file metadata). In the common case 
(when there are few deletes) this SCAN is in charge of scanning the vast 
majority of data files.

ref: [https://issues.apache.org/jira/browse/IMPALA-11802]

  was:
Simple SELECT count( * ) FROM tbl_ice; could be optimized.

1. If a V2 table doesn't have any delete files, then the cardinality is

Cardinality(data files)
2. If the above is not true, we can still optimize count( * ) queries by:

        SUM
         |
     UNION ALL
      /     \
     /       \
    /         \
COUNT(*)     COUNT(*)
  /                \
SCAN             ANTI JOIN
data files         /      \
without           /        \
deletes       SCAN         SCAN
              data files   delete files
              with deletes
The SCAN operator with "data files without deletes" could benefit from count( * 
) optimization (they would only need to read file metadata). In the common case 
(when there are few deletes) this SCAN is in charge of scanning the vast 
majority of data files.

ref: https://issues.apache.org/jira/browse/IMPALA-11802


> Optimize count(*) queries on Iceberg tables
> -------------------------------------------
>
>                 Key: HIVE-28814
>                 URL: https://issues.apache.org/jira/browse/HIVE-28814
>             Project: Hive
>          Issue Type: Test
>            Reporter: Denys Kuzmenko
>            Priority: Major
>
> Simple {{SELECT count( * ) FROM tbl_ice;}} could be optimized.
> 1. If a V2 table doesn't have any delete files, then the cardinality is
> {noformat}
> Cardinality(data files){noformat}
> 2. If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
>         SUM
>          |
>      UNION ALL
>       /     \
>      /       \
>     /         \
> COUNT(*)     COUNT(*)
>   /                \
> SCAN             ANTI JOIN
> data files         /      \
> without           /        \
> deletes       SCAN         SCAN
>               data files   delete files
>               with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count( 
> * ) optimization (they would only need to read file metadata). In the common 
> case (when there are few deletes) this SCAN is in charge of scanning the vast 
> majority of data files.
> ref: [https://issues.apache.org/jira/browse/IMPALA-11802]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28814) Optimize count(*) queries on Iceberg tables

Reply via email to