[jira] [Updated] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables

Jira Thu, 15 Dec 2022 07:59:09 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zoltán Borók-Nagy updated IMPALA-11802:
---------------------------------------
    Description: 
Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.

At first we need to investigate if the following is true:
If a V2 table only has position delete files, then the cardinality is
{noformat}
Cardinality(data files) - Cardinality(delete files)
{noformat}
If this is true, we answer count( * ) queries via a query rewrite similarly to 
what we do for V1 tables: IMPALA-11279

If the above is not true, we can still optimize count( * ) queries by:
{noformat}
        SUM
         |
     UNION ALL
      /     \
     /       \
    /         \
COUNT(*)     COUNT(*)
  /                \
SCAN             ANTI JOIN
data files         /      \
without           /        \
deletes       SCAN         SCAN
              data files   delete files
              with deletes
{noformat}

The SCAN operator with "data files without deletes" could benefit from count( * 
) optimization (they would only need to read file metadata). In the common case 
(when there are few deletes) this SCAN is in charge of scanning the vast 
majority of data files.

  was:
Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.

At first we need to investigate if the following is true:
If a V2 table only has position delete files, then the cardinality is
{noformat}
Cardinality(data files) - Cardinality(delete files)
{noformat}
If this is true, we answer count( * ) queries via a query rewrite similarly to 
what we do for V1 tables: IMPALA-11279

If the above is not true, we can still optimize count( * ) queries by:
{noformat}
        SUM
         |
       UNION
      /     \
     /       \
    /         \
COUNT(*)     COUNT(*)
  /                \
SCAN             ANTI JOIN
data files         /      \
without           /        \
deletes       SCAN         SCAN
              data files   delete files
              with deletes
{noformat}

The SCAN operator with "data files without deletes" could benefit from count( * 
) optimization (they would only need to read file metadata). In the common case 
(when there are few deletes) this SCAN is in charge of scanning the vast 
majority of data files.


> Optimize count(*) queries for Iceberg V2 tables
> -----------------------------------------------
>
>                 Key: IMPALA-11802
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11802
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.
> At first we need to investigate if the following is true:
> If a V2 table only has position delete files, then the cardinality is
> {noformat}
> Cardinality(data files) - Cardinality(delete files)
> {noformat}
> If this is true, we answer count( * ) queries via a query rewrite similarly 
> to what we do for V1 tables: IMPALA-11279
> If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
>         SUM
>          |
>      UNION ALL
>       /     \
>      /       \
>     /         \
> COUNT(*)     COUNT(*)
>   /                \
> SCAN             ANTI JOIN
> data files         /      \
> without           /        \
> deletes       SCAN         SCAN
>               data files   delete files
>               with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count( 
> * ) optimization (they would only need to read file metadata). In the common 
> case (when there are few deletes) this SCAN is in charge of scanning the vast 
> majority of data files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables

Reply via email to