[
https://issues.apache.org/jira/browse/IMPALA-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Becker resolved IMPALA-13609.
------------------------------------
Resolution: Implemented
> Store Iceberg snapshot id for COMPUTE STATS
> -------------------------------------------
>
> Key: IMPALA-13609
> URL: https://issues.apache.org/jira/browse/IMPALA-13609
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Daniel Becker
> Assignee: Daniel Becker
> Priority: Major
>
> Currently, when COMPUTE STATS is run from Impala, we set the
> 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the
> other hand, store the snapshot id for which stats were calculated. Although
> it is possible to retrieve the timestamp of a snapshot, comparing these two
> values is error-prone, e.g. in the following situation
> * COMPUTE STATS calculation is running on Snapshot N
> * Snapshot N+1 is committed at time T
> * COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T +
> Delta
> * Some engine writes Puffin statistics for Snapshot N+1
> After this, HMS stats will appear to be more recent even though it was
> calculated on Snapshot N, while we have Puffin stats for Snapshot N+1.
> To resolve this, COMPUTE STATS could set a new table property, e.g.
> 'impala.computeStatsSnapshotId'.
> On the other hand, COMPUTE STATS could be set to calculate stats for only a
> subset of the columns, and then a different subset in a subsequent run. The
> recency of the stats will then be different for each column. We could
> consider storing the snapshot id on a per column basis.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)