Daniel Becker has uploaded a new patch set (#19). ( 
http://gerrit.cloudera.org:8080/21605 )

Change subject: IMPALA-13247: Support Reading Puffin files for the current 
snapshot
......................................................................

IMPALA-13247: Support Reading Puffin files for the current snapshot

This change adds support for reading NDV statistics from Puffin files
when they are available for the current snapshot. Puffin files or blobs
that were written for other snapshots than the current one are ignored.

The NDV values read from Puffin files take precedence over NDV values
stored in the HMS. This is because we only read Puffin stats for the
current snapshot, so these values are always up-to-date, while the
values in the HMS may be stale.

Note that it is currently not possible to drop Puffin stats from Impala.
For this reason, this patch also introduces two ways of disabling the
reading of Puffin stats:
  - globally, with the "--disable_reading_puffin_stats" startup flag:
    when it is set to true, Impala will never read Puffin stats
  - for specific tables, by setting the
    "impala.iceberg_disable_reading_puffin_stats" table property to
    true.

Note that this change is only about reading Puffin files, Impala does
not yet support writing them.

Testing:
 - created the PuffinDataGenerator tool which can generate Puffin files
   and metadata.json files for different scenarios (e.g. all stats are
   in the same Puffin file; stats for different columns are in different
   Puffin files; some Puffin files are corrupt etc.). The generated
   files are under the "testdata/ice_puffin/generated" directory.
 - The new test class 'test_iceberg.py::TestIcebergTableWithPuffinStats'
   uses the generated data to test various scenarios.
 - Added custom cluster tests in test_iceberg_with_puffin_stats.py that
   test the 'disable_reading_puffin_stats' startup flag.

Change-Id: I50c1228988960a686d08a9b2942e01e366678866
---
M be/src/common/global-flags.cc
M be/src/util/backend-gflag-util.cc
M bin/impala-config.sh
M common/thrift/BackendGflags.thrift
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
A fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M java/pom.xml
A 
java/puffin-data-generator/00002-3c6b1ffe-ba85-4be4-a590-7c39428931e1.metadata.json
A java/puffin-data-generator/pom.xml
A 
java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/data/0747babcda9277bf-954aff1b00000000_1684663509_data.0.parq
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/11cd04ec-55ea-40aa-a89b-197c3c275e7a-m0.avro
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/20240906_085606_00006_wsfgs-4d9242d5-bd79-4069-be8b-2cfced8e0647.stats
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/snap-1880359224532128423-1-11cd04ec-55ea-40aa-a89b-197c3c275e7a.avro
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v1.metadata.json
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v2.metadata.json
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v3.metadata.json
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/version-hint.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/ice_puffin/00001-8661be83-1fa1-4323-9d7c-fb33cfb17e71.metadata.json
A testdata/ice_puffin/README
A testdata/ice_puffin/generated/all_files_corrupt.metadata.json
A testdata/ice_puffin/generated/all_stats.stats
A testdata/ice_puffin/generated/all_stats_in_1_file.metadata.json
A testdata/ice_puffin/generated/corrupt_file.stats
A testdata/ice_puffin/generated/corrupt_file1.stats
A testdata/ice_puffin/generated/corrupt_file2.stats
A testdata/ice_puffin/generated/current_snapshot_id.stats
A testdata/ice_puffin/generated/duplicate_stats_in_1_file.metadata.json
A testdata/ice_puffin/generated/duplicate_stats_in_1_file.stats
A testdata/ice_puffin/generated/duplicate_stats_in_2_files.metadata.json
A testdata/ice_puffin/generated/duplicate_stats_in_2_files1.stats
A testdata/ice_puffin/generated/duplicate_stats_in_2_files2.stats
A testdata/ice_puffin/generated/existing_file.stats
A testdata/ice_puffin/generated/file_contains_invalid_field_id.metadata.json
A testdata/ice_puffin/generated/file_contains_invalid_field_id.stats
A testdata/ice_puffin/generated/missing_file.metadata.json
A testdata/ice_puffin/generated/non_corrupt_file.stats
A testdata/ice_puffin/generated/not_all_blobs_current.metadata.json
A testdata/ice_puffin/generated/not_all_blobs_current.stats
A testdata/ice_puffin/generated/not_current_snapshot_id.stats
A testdata/ice_puffin/generated/one_file_corrupt_one_not.metadata.json
A testdata/ice_puffin/generated/one_file_current_one_not.metadata.json
A testdata/ice_puffin/generated/stats_divided.metadata.json
A testdata/ice_puffin/generated/stats_divided1.stats
A testdata/ice_puffin/generated/stats_divided2.stats
A 
testdata/ice_puffin/snap-1072310337995780425-1-f40f834f-df8a-4220-8377-2d2bd7be1150.avro
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_iceberg_with_puffin_stats_startup_flag.py
M tests/metadata/test_ddl_base.py
M tests/query_test/test_ext_data_sources.py
M tests/query_test/test_iceberg.py
55 files changed, 3,759 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/05/21605/19
--
To view, visit http://gerrit.cloudera.org:8080/21605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Gerrit-Change-Number: 21605
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to