[
https://issues.apache.org/jira/browse/IMPALA-11577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boglarka Egyed updated IMPALA-11577:
------------------------------------
Priority: Minor (was: Major)
> Optimize getting stored file types for Iceberg tables
> -----------------------------------------------------
>
> Key: IMPALA-11577
> URL: https://issues.apache.org/jira/browse/IMPALA-11577
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Gergely Fürnstáhl
> Priority: Minor
> Labels: impala-iceberg
>
> Spawned from IMPALA-10610
> Impala supports mixed file formats for Iceberg tables, which means every file
> can have different file format and it uses the set of existing file formats
> for planning purposes. Currently Impala goes through all file's metadata to
> aggregate this information, which can be slow if there are lots of data files.
> We could optimized this by storing this aggregated information somewhere
> (e.g. in Iceberg - yet to be implemented -
> [https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java])
> Update:
> IcebergContentFileStore supports aggregating file formats, similarly as the
> proposed change in iceberg, but might not be the correct approach. It
> represents the state of the current snapshot, but IcebergScanNode receives a
> possibly time travelled/pruned version of the file descriptor list.
> Further optimization ideas:
> * Use the aggregated info from the iceberg
> SnapshotSummary/IcebergContentFileStore if possible (e.g. current snapshot,
> no pruning)
> * Exit the loop if we found all the available file formats (might be to
> costly if we check the condition it in every iteration)
> * Parallelize the aggregation
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]