[jira] [Updated] (IMPALA-11577) Optimize getting stored file types for Iceberg tables

Boglarka Egyed (Jira) Thu, 05 Jun 2025 04:28:05 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-11577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Boglarka Egyed updated IMPALA-11577:
------------------------------------
    Priority: Minor  (was: Major)

> Optimize getting stored file types for Iceberg tables
> -----------------------------------------------------
>
>                 Key: IMPALA-11577
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11577
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Gergely Fürnstáhl
>            Priority: Minor
>              Labels: impala-iceberg
>
> Spawned from IMPALA-10610
> Impala supports mixed file formats for Iceberg tables, which means every file 
> can have different file format and it uses the set of existing file formats 
> for planning purposes. Currently Impala goes through all file's metadata to 
> aggregate this information, which can be slow if there are lots of data files.
> We could optimized this by storing this aggregated information somewhere 
> (e.g. in Iceberg - yet to be implemented - 
> [https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java])
> Update:
> IcebergContentFileStore supports aggregating file formats, similarly as the 
> proposed change in iceberg, but might not be the correct approach. It 
> represents the state of the current snapshot, but IcebergScanNode receives a 
> possibly time travelled/pruned version of the file descriptor list. 
> Further optimization ideas:
>  * Use the aggregated info from the iceberg 
> SnapshotSummary/IcebergContentFileStore if possible (e.g. current snapshot, 
> no pruning)
>  * Exit the loop if we found all the available file formats (might be to 
> costly if we check the condition it in every iteration)
>  * Parallelize the aggregation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-11577) Optimize getting stored file types for Iceberg tables

Reply via email to