Daniel Becker has uploaded a new patch set (#30). ( 
http://gerrit.cloudera.org:8080/21605 )

Change subject: IMPALA-13247: Support Reading Puffin files for the current 
snapshot
......................................................................

IMPALA-13247: Support Reading Puffin files for the current snapshot

This change adds support for reading NDV statistics from Puffin files
when they are available for the current snapshot. Puffin files or blobs
that were written for other snapshots than the current one are ignored.
Because this behaviour is different from what we have for HMS stats and
may therefore be unintuitive for users, reading Puffin stats is disabled
by default; set the "--disable_reading_puffin_stats" startup flag to
false to enable it.

When Puffin stats reading is enabled, the NDV values read from Puffin
files take precedence over NDV values stored in the HMS. This is because
we only read Puffin stats for the current snapshot, so these values are
always up-to-date, while the values in the HMS may be stale.

Note that it is currently not possible to drop Puffin stats from Impala.
For this reason, this patch also introduces two ways of disabling the
reading of Puffin stats:
  - globally, with the aforementioned "--disable_reading_puffin_stats"
    startup flag: when it is set to true, Impala will never read Puffin
    stats
  - for specific tables, by setting the
    "impala.iceberg_disable_reading_puffin_stats" table property to
    true.

Note that this change is only about reading Puffin files, Impala does
not yet support writing them.

Testing:
 - created the PuffinDataGenerator tool which can generate Puffin files
   and metadata.json files for different scenarios (e.g. all stats are
   in the same Puffin file; stats for different columns are in different
   Puffin files; some Puffin files are corrupt etc.). The generated
   files are under the "testdata/ice_puffin/generated" directory.
 - The new custom cluster test class
   'test_iceberg_with_puffin.py::TestIcebergTableWithPuffinStats' uses
   the generated data to test various scenarios.
 - Added custom cluster tests that test the
   'disable_reading_puffin_stats' startup flag.

Change-Id: I50c1228988960a686d08a9b2942e01e366678866
---
M be/src/catalog/catalog.cc
M be/src/util/backend-gflag-util.cc
M bin/impala-config.sh
M common/thrift/BackendGflags.thrift
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
A fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M java/pom.xml
A 
java/puffin-data-generator/00002-3c6b1ffe-ba85-4be4-a590-7c39428931e1.metadata.json
A java/puffin-data-generator/pom.xml
A 
java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/data/0747babcda9277bf-954aff1b00000000_1684663509_data.0.parq
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/11cd04ec-55ea-40aa-a89b-197c3c275e7a-m0.avro
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/20240906_085606_00006_wsfgs-4d9242d5-bd79-4069-be8b-2cfced8e0647.stats
A 
testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/snap-1880359224532128423-1-11cd04ec-55ea-40aa-a89b-197c3c275e7a.avro
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v1.metadata.json
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v2.metadata.json
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v3.metadata.json
A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/version-hint.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/ice_puffin/00001-2e1ade02-35ae-4a8f-a84f-784d1e0c0790.metadata.json
A testdata/ice_puffin/README
A testdata/ice_puffin/generated/all_files_corrupt.metadata.json
A testdata/ice_puffin/generated/all_stats.stats
A testdata/ice_puffin/generated/all_stats_in_1_file.metadata.json
A testdata/ice_puffin/generated/corrupt_file.stats
A testdata/ice_puffin/generated/corrupt_file1.stats
A testdata/ice_puffin/generated/corrupt_file2.stats
A testdata/ice_puffin/generated/current_snapshot_id.stats
A testdata/ice_puffin/generated/duplicate_stats_in_1_file.metadata.json
A testdata/ice_puffin/generated/duplicate_stats_in_1_file.stats
A testdata/ice_puffin/generated/duplicate_stats_in_2_files.metadata.json
A testdata/ice_puffin/generated/duplicate_stats_in_2_files1.stats
A testdata/ice_puffin/generated/duplicate_stats_in_2_files2.stats
A testdata/ice_puffin/generated/existing_file.stats
A testdata/ice_puffin/generated/file_contains_invalid_field_id.metadata.json
A testdata/ice_puffin/generated/file_contains_invalid_field_id.stats
A testdata/ice_puffin/generated/missing_file.metadata.json
A testdata/ice_puffin/generated/non_corrupt_file.stats
A testdata/ice_puffin/generated/not_all_blobs_current.metadata.json
A testdata/ice_puffin/generated/not_all_blobs_current.stats
A testdata/ice_puffin/generated/not_current_snapshot_id.stats
A testdata/ice_puffin/generated/one_file_corrupt_one_not.metadata.json
A testdata/ice_puffin/generated/one_file_current_one_not.metadata.json
A testdata/ice_puffin/generated/stats_divided.metadata.json
A testdata/ice_puffin/generated/stats_divided1.stats
A testdata/ice_puffin/generated/stats_divided2.stats
A testdata/ice_puffin/generated/stats_for_unsupported_type.metadata.json
A testdata/ice_puffin/generated/stats_for_unsupported_type.stats
A 
testdata/ice_puffin/snap-2532372403033748214-1-c9f94c00-2920-4a39-8e7f-c7faf7e71a7d.avro
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_iceberg_with_puffin.py
M tests/metadata/test_ddl_base.py
M tests/query_test/test_ext_data_sources.py
M tests/query_test/test_iceberg.py
57 files changed, 4,138 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/05/21605/30
--
To view, visit http://gerrit.cloudera.org:8080/21605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Gerrit-Change-Number: 21605
Gerrit-PatchSet: 30
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to