[
https://issues.apache.org/jira/browse/IMPALA-11986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045794#comment-18045794
]
Noémi Pap-Takács commented on IMPALA-11986:
-------------------------------------------
This could be done based on the logic that the partition table introduced in
IMPALA-13267, and after some changes expected in IMPALA-14564.
> Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT part_col)/ queries for
> Iceberg tables
> -------------------------------------------------------------------------------------------
>
> Key: IMPALA-11986
> URL: https://issues.apache.org/jira/browse/IMPALA-11986
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Li Penglin
> Assignee: Noémi Pap-Takács
> Priority: Major
> Labels: impala-iceberg, performance
>
> For Iceberg V1 and V2 tables without deletes:
> [https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html]
> OPTIMIZE_PARTITION_KEY_SCANS optimizes the MIN(key_column), MAX(key_column),
> and COUNT(DISTINCT key_column) by 'TBLS' table and 'PARTITION_KEY_VALS'
> partition key column in the HMS metadata. For the Iceberg tables, its
> partitioning stats is not stored in the HMS, but can be obtained through the
> Iceberg API. We can optimize query performance for MIN(key_column),
> MAX(key_column), or COUNT(DISTINCT key_column) by similar idea, but we should
> make sure that 'Partition Transforms' is 'identity'.
> For non-partitioned columns, if min-max information is stored in Iceberg
> meta, the MIN(column) and MAX(column) queries can also be optimized based on
> this idea?
> But impala does not guarantee that the statistics for these non-partitioned
> columns are complete, it's confusing things.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]