Zoltán Borók-Nagy created IMPALA-11583:
------------------------------------------
Summary: Use Iceberg APIs to update table properties for Iceberg
tables
Key: IMPALA-11583
URL: https://issues.apache.org/jira/browse/IMPALA-11583
Project: IMPALA
Issue Type: Bug
Components: Catalog
Reporter: Zoltán Borók-Nagy
COMPUTE STATS updates table-level stats via alter_table() HMS API. This
replaces the whole HMS table, therefore if there are concurrent modifications
by another engine, e.g. Hive, it's possible that these modifications are lost.
This is critical for Iceberg tables, as the 'metadata_location' table property
must always point to the latest snapshot. Inadvertently rewriting it during
COMPUTE STATS can result in a data loss.
Table-level stats like 'numRows' and 'totalSize' are already updated by Iceberg
during table modifications, i.e. there is no need to update these values for
COMPUTE STATS.
Column stats are not affected as they are updated via a different API call
(updateTableColumnStatistics()), and it doesn't touch the table properties. But
updating statistics also require us to update table property
"impala.lastComputeStatsTime". We should update it via Iceberg APIs when
HiveCatalog is used:
https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java#L211
For other catalogs than HiveCatalog we still need to update the table property
via HMS API. It should be safe as other catalogs don't depend on HMS table
properties.
Reloading the HMS table before invoking 'alter_table()' can be considered in
other cases (non-Iceberg tables as well), to decrease the possibility of losing
concurrent table updates.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)