Daniel Becker has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/22339


Change subject: IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS
......................................................................

IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS

Currently, when COMPUTE STATS is run from Impala, we set the
'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on
the other hand, store the snapshot id for which the stats were
calculated.  Although it is possible to retrieve the timestamp of a
snapshot, comparing these two values is error-prone, e.g. in the
following situation:

 - COMPUTE STATS calculation is running on snapshot N
 - snapshot N+1 is committed at time T
 - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time
   T + Delta
 - some engine writes Puffin statistics for snapshot N+1

After this, HMS stats will appear to be more recent even though they
were calculated on snapshot N, while we have Puffin stats for snapshot
N+1.

To resolve this, COMPUTE STATS now sets a new table property,
'impala.computeStatsSnapshotIds'. This property stores the snapshot id
for which stats have been computed, for each column. It is a
comma-separated list of values of the form "fieldId:snapshotId".

Storing the snapshot ids on a per-column basis is needed because COMPUTE
STATS can be set to calculate stats for only a subset of the columns,
and then a different subset in a subsequent run. The recency of the
stats will then be different for each column.

Storing the Iceberg field ids instead of column names makes the format
easier to handle as we do not need to take care of escaping special
characters.

Tables may have many columns, so to prevent the
'impala.lastComputeStatsTime' table property from becoming too long, it
will only include information for 10 columns by default. This can be
modified for a table by setting the
'impala.computeStatsSnapshotIdsMaxSize' table property to the
appropriate value.

Testing:
 - Added tests in iceberg-compute-stats.test.

Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7
---
M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-compute-stats.test
3 files changed, 148 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/22339/3
--
To view, visit http://gerrit.cloudera.org:8080/22339
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7
Gerrit-Change-Number: 22339
Gerrit-PatchSet: 3
Gerrit-Owner: Daniel Becker <[email protected]>

Reply via email to