Li Penglin created IMPALA-11986:
-----------------------------------
Summary: Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT
part_col)/ queries for Iceberg tables
Key: IMPALA-11986
URL: https://issues.apache.org/jira/browse/IMPALA-11986
Project: IMPALA
Issue Type: Improvement
Reporter: Li Penglin
For Iceberg V1 and V2 tables without deletes:
https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html
OPTIMIZE_PARTITION_KEY_SCANS optimizes the MIN(key_column), MAX(key_column),
and COUNT(DISTINCT key_column) by 'TBLS' table and 'PARTITION_KEY_VALS'
partition key column in the HMS metadata. For the Iceberg tables, its
partitioning stats is not stored in the HMS, but can be obtained through the
Iceberg API. We can optimize query performance for MIN(key_column),
MAX(key_column), or COUNT(DISTINCT key_column) by similar idea, but we should
make sure that 'Partition Transforms' is 'identity'.
For non-partitioned columns, if min-max information is stored in Iceberg meta,
the MIN(column) and MAX(column) queries can also be optimized based on this
idea?
But impala does not guarantee that the statistics for these non-partitioned
columns are complete, it's confusing things.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]