queries for Iceberg tables

Li Penglin (Jira) Wed, 08 Mar 2023 01:11:36 -0800

Li Penglin created IMPALA-11986:
-----------------------------------

             Summary: Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT 
part_col)/ queries for Iceberg tables
                 Key: IMPALA-11986
                 URL: https://issues.apache.org/jira/browse/IMPALA-11986
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Li Penglin



For Iceberg V1 and V2 tables without deletes:
https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html
 OPTIMIZE_PARTITION_KEY_SCANS optimizes the MIN(key_column), MAX(key_column), 
and COUNT(DISTINCT key_column) by 'TBLS' table and 'PARTITION_KEY_VALS' 
partition key column in the HMS metadata. For the Iceberg tables, its 
partitioning stats is not stored in the HMS, but can be obtained through the 
Iceberg API. We can optimize query performance for MIN(key_column), 
MAX(key_column), or COUNT(DISTINCT key_column) by similar idea, but we should 
make sure that 'Partition Transforms' is 'identity'.
For non-partitioned columns, if min-max information is stored in Iceberg meta, 
the MIN(column) and MAX(column) queries can also be optimized based on this 
idea?
But impala does not guarantee that the statistics for these non-partitioned 
columns are complete, it's confusing things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (IMPALA-11986) Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT part_col)/ queries for Iceberg tables

Reply via email to