[
https://issues.apache.org/jira/browse/IMPALA-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy resolved IMPALA-11721.
----------------------------------------
Fix Version/s: Impala 4.2.0
Resolution: Fixed
> Impala query keep being retried over frequently updated iceberg table
> ---------------------------------------------------------------------
>
> Key: IMPALA-11721
> URL: https://issues.apache.org/jira/browse/IMPALA-11721
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg
> Fix For: Impala 4.2.0
>
>
> Iceberg table loading can fail in local catalog mode if the table gets
> updated frequently.
> This is what happens during table loading in local catalog mode:
> Every query starts with it's own empty local catalog. Table metadata is
> fetched in multiple requests via a MetaProvider which is always a
> CatalogdMetaprovider. CatalogdMetaprovider caches requests and the cache key
> also includes the table's catalog version.
> The Iceberg table is loaded by the following requests:
> # CatalogdMetaProvider.loadTable()
> # CatalogdMetaProvider.loadIcebergTable()
> # CatalogdMetaProvider.loadIcebergApiTable() # This actually directly loads
> the Iceberg table via Iceberg API (no CatalogD involved)
> # CatalogdMetaProvider.loadTableColumnStatistics()
> # CatalogdMetaProvider.loadPartitionList()
> # CatalogdMetaProvider.loadPartitionsByRefs()
> Steps 1-4 happens during table loading, steps 5-6 happens during planning. We
> cannot really reorder these invocations, but since CatalogdMetaprovider
> caches these, only the very first invocations need to reach out to CatalogD
> and check the table's catalot version. Subsequent invocations, i.e.
> subsequent queries that use the Iceberg table can use the cached metadata,
> and no need to check the catalog version of the cached metadata since the
> cache key also includes the catalog version.
> I see two things that could resolve the issue:
> # speedup loadIcebergApiTable()
> ** either by speeding up Iceberg table loading itself
> ** or make the Iceberg API table serializable, so we can fetch it from
> CatalogD
> # Pre-warm the cache before issuing loadIcebergApiTable()
> ** so the CatalogdMetaProvider.load*() operations can be served from cache
> 1 needs contributions to the Iceberg library
> 2 can be done relatively easily. We just need to pre-invoke
> loadTableColumnStatistics() and
> FeCatalogUtils.loadAllPartitions() (which invokes loadPartitionList() and
> loadPartitionsByRefs()) before loadIcebergApiTable(). So when they are needed
> later they can be served from cache.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)