[jira] [Resolved] (IMPALA-11721) Impala query keep being retried over frequently updated iceberg table

Jira Tue, 15 Nov 2022 01:38:07 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zoltán Borók-Nagy resolved IMPALA-11721.
----------------------------------------
    Fix Version/s: Impala 4.2.0
       Resolution: Fixed

> Impala query keep being retried over frequently updated iceberg table
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-11721
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11721
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.2.0
>
>
> Iceberg table loading can fail in local catalog mode if the table gets 
> updated frequently.
> This is what happens during table loading in local catalog mode:
> Every query starts with it's own empty local catalog. Table metadata is 
> fetched in multiple requests via a MetaProvider which is always a 
> CatalogdMetaprovider. CatalogdMetaprovider caches requests and the cache key 
> also includes the table's catalog version.
> The Iceberg table is loaded by the following requests:
>  # CatalogdMetaProvider.loadTable()
>  # CatalogdMetaProvider.loadIcebergTable()
>  # CatalogdMetaProvider.loadIcebergApiTable() # This actually directly loads 
> the Iceberg table via Iceberg API (no CatalogD involved)
>  # CatalogdMetaProvider.loadTableColumnStatistics()
>  # CatalogdMetaProvider.loadPartitionList()
>  # CatalogdMetaProvider.loadPartitionsByRefs()
> Steps 1-4 happens during table loading, steps 5-6 happens during planning. We 
> cannot really reorder these invocations, but since CatalogdMetaprovider 
> caches these, only the very first invocations need to reach out to CatalogD 
> and check the table's catalot version. Subsequent invocations, i.e. 
> subsequent queries that use the Iceberg table can use the cached metadata, 
> and no need to check the catalog version of the cached metadata since the 
> cache key also includes the catalog version.
> I see two things that could resolve the issue:
>  # speedup loadIcebergApiTable()
>  ** either by speeding up Iceberg table loading itself
>  ** or make the Iceberg API table serializable, so we can fetch it from 
> CatalogD
>  # Pre-warm the cache before issuing loadIcebergApiTable()
>  ** so the CatalogdMetaProvider.load*() operations can be served from cache
> 1 needs contributions to the Iceberg library
> 2 can be done relatively easily. We just need to pre-invoke 
> loadTableColumnStatistics() and 
> FeCatalogUtils.loadAllPartitions() (which invokes loadPartitionList() and 
> loadPartitionsByRefs()) before loadIcebergApiTable(). So when they are needed 
> later they can be served from cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IMPALA-11721) Impala query keep being retried over frequently updated iceberg table

Reply via email to