[ 
https://issues.apache.org/jira/browse/IMPALA-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608440#comment-17608440
 ] 

ASF subversion and git services commented on IMPALA-11583:
----------------------------------------------------------

Commit 3f382b7ebbd66a5a02270e14ff493bd9607c0b94 in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3f382b7eb ]

IMPALA-11583: Use Iceberg API to update stats

Before this patch we used HMS API alter_table() to update an Iceberg
table's statistics. 'alter_table()' API calls are unsafe for Iceberg
tables as they overwrite the whole HMS table, including the table
property 'metadata_location' which must always point to the latest
snapshot. Hence concurrent modification to the same table could be
reverted by COMPUTE STATS.

In this patch we are using Iceberg API to update Iceberg tables.
Also, table-level stats (e.g. numRows, totalSize, totalFiles) are not
set as Iceberg keeps them up-to-date.

COMPUTE INCREMENTAL STATS without partition clause is the same as
plain COMPUTE STATS for Iceberg tables. This behavior is aligned
with current behavior on non-partitioned tables:
https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html

COMPUTE INCREMENTAL STATS .. PARTITION raises an error.

DROP STATS has been also modified to not drop table-level stats for
HMS-integrated Iceberg tables.

Testing:
 * added e2e tests for COMPUTE STATS
 * added e2e tests for DROP STATS
 * manually tested concurrent Hive INSERT and Impala COMPUTE STATS
   using latest Hive
 * opened IMPALA-11590 to add automated interop tests with Hive

Change-Id: I46b6e0a5a65e18e5aaf2a007ec0242b28e0fed92
Reviewed-on: http://gerrit.cloudera.org:8080/18995
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Use Iceberg APIs to update table properties for Iceberg tables
> --------------------------------------------------------------
>
>                 Key: IMPALA-11583
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11583
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> COMPUTE STATS updates table-level stats via alter_table() HMS API. This 
> replaces the whole HMS table, therefore if there are concurrent modifications 
> by another engine, e.g. Hive, it's possible that these modifications are lost.
> This is critical for Iceberg tables, as the 'metadata_location' table 
> property must always point to the latest snapshot. Inadvertently rewriting it 
> during COMPUTE STATS can result in a data loss.
> Table-level stats like 'numRows' and 'totalSize' are already updated by 
> Iceberg during table modifications, i.e. there is no need to update these 
> values for COMPUTE STATS.
> Column stats are not affected as they are updated via a different API call 
> ([updateTableColumnStatistics|https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L1638()]),
>  and it doesn't touch the table properties. But updating statistics also 
> require us to update table property "impala.lastComputeStatsTime".  We should 
> update it via Iceberg APIs when HiveCatalog is used:
> https://github.com/apache/impala/blob/4e813b7085c995a7244ef886b00c22e9d93cc80c/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java#L211
> For other catalogs than HiveCatalog we still need to update the table 
> property via HMS API. It should be safe as other catalogs don't depend on HMS 
> table properties.
> Reloading the HMS table before invoking 'alter_table()' can be considered in 
> other cases (non-Iceberg tables as well), to decrease the possibility of 
> losing concurrent table updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to