[
https://issues.apache.org/jira/browse/IMPALA-11986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060759#comment-18060759
]
ASF subversion and git services commented on IMPALA-11986:
----------------------------------------------------------
Commit bf7c2088dd5495a763ff9a381970f99e6101cd4b in impala's branch
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bf7c2088d ]
IMPALA-11986: (part 1) Optimize partition key scans for Iceberg tables
This patch optimizes queries that only scan IDENTITY-partitioned
columns. The optimization only applies, if:
* All materialized aggregate expressions have distinct semantics
(e.g. MIN, MAX, NDV). In other words, this optimization will work
for COUNT(DISTINCT c) but not COUNT(c).
* All materialized columns are IDENTITY-partitioned in all partition
specs (this can be relaxed later)
If the above conditions are met, then each data file (without deletes)
only produce a single record. The rest of the table (data files with
deletes and delete files) are scanned normally.
Testing:
* added e2e tests
Change-Id: I32f78ee60ac4a410e91cf0e858199dd39d2e9afe
Reviewed-on: http://gerrit.cloudera.org:8080/23985
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT part_col)/ queries for
> Iceberg tables
> -------------------------------------------------------------------------------------------
>
> Key: IMPALA-11986
> URL: https://issues.apache.org/jira/browse/IMPALA-11986
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Li Penglin
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg, performance
>
> For Iceberg V1 and V2 tables without deletes:
> [https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html]
> OPTIMIZE_PARTITION_KEY_SCANS optimizes the MIN(key_column), MAX(key_column),
> and COUNT(DISTINCT key_column) by 'TBLS' table and 'PARTITION_KEY_VALS'
> partition key column in the HMS metadata. For the Iceberg tables, its
> partitioning stats is not stored in the HMS, but can be obtained through the
> Iceberg API. We can optimize query performance for MIN(key_column),
> MAX(key_column), or COUNT(DISTINCT key_column) by similar idea, but we should
> make sure that 'Partition Transforms' is 'identity'.
> For non-partitioned columns, if min-max information is stored in Iceberg
> meta, the MIN(column) and MAX(column) queries can also be optimized based on
> this idea?
> But impala does not guarantee that the statistics for these non-partitioned
> columns are complete, it's confusing things.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]