Daniel Becker has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/21605 )
Change subject: IMPALA-13247: Support Reading Puffin files for the current snapshot ...................................................................... IMPALA-13247: Support Reading Puffin files for the current snapshot This change adds support for reading NDV statistics from Puffin files when they are available for the current snapshot. Puffin files or blobs that were written for other snapshots than the current one are ignored. The NDV values read from Puffin files take precedence over NDV values stored in the HMS. This is because we only read Puffin stats for the current snapshot, so these values are always up-to-date, while the values in the HMS may be stale. Note that it is currently not possible to drop Puffin stats from Impala. For this reason, this patch also introduces two ways of disabling the reading of Puffin stats: - globally, with the "--disable_reading_puffin_stats" startup flag: when it is set to true, Impala will never read Puffin stats - for specific tables, by setting the "impala.iceberg_disable_reading_puffin_stats" table property to true. Note that this change is only about reading Puffin files, Impala does not yet support writing them. Testing: - created the PuffinDataGenerator tool which can generate Puffin files and metadata.json files for different scenarios (e.g. all stats are in the same Puffin file; stats for different columns are in different Puffin files; some Puffin files are corrupt etc.). The generated files are under the "testdata/ice_puffin/generated" directory. - The new test class 'test_iceberg.py::TestIcebergTableWithPuffinStats' uses the generated data to test various scenarios. - Added custom cluster tests in test_iceberg_with_puffin_stats.py that test the 'disable_reading_puffin_stats' startup flag. Change-Id: I50c1228988960a686d08a9b2942e01e366678866 --- M be/src/common/global-flags.cc M be/src/util/backend-gflag-util.cc M bin/impala-config.sh M common/thrift/BackendGflags.thrift M fe/pom.xml M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java A fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M java/pom.xml A java/puffin-data-generator/00002-3c6b1ffe-ba85-4be4-a590-7c39428931e1.metadata.json A java/puffin-data-generator/pom.xml A java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java A testdata/data/iceberg_test/iceberg_with_puffin_stats/data/0747babcda9277bf-954aff1b00000000_1684663509_data.0.parq A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/11cd04ec-55ea-40aa-a89b-197c3c275e7a-m0.avro A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/20240906_085606_00006_wsfgs-4d9242d5-bd79-4069-be8b-2cfced8e0647.stats A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/snap-1880359224532128423-1-11cd04ec-55ea-40aa-a89b-197c3c275e7a.avro A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v1.metadata.json A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v2.metadata.json A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/v3.metadata.json A testdata/data/iceberg_test/iceberg_with_puffin_stats/metadata/version-hint.txt M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/ice_puffin/00001-8661be83-1fa1-4323-9d7c-fb33cfb17e71.metadata.json A testdata/ice_puffin/README A testdata/ice_puffin/generated/all_files_corrupt.metadata.json A testdata/ice_puffin/generated/all_stats.stats A testdata/ice_puffin/generated/all_stats_in_1_file.metadata.json A testdata/ice_puffin/generated/corrupt_file.stats A testdata/ice_puffin/generated/corrupt_file1.stats A testdata/ice_puffin/generated/corrupt_file2.stats A testdata/ice_puffin/generated/current_snapshot_id.stats A testdata/ice_puffin/generated/duplicate_stats_in_1_file.metadata.json A testdata/ice_puffin/generated/duplicate_stats_in_1_file.stats A testdata/ice_puffin/generated/duplicate_stats_in_2_files.metadata.json A testdata/ice_puffin/generated/duplicate_stats_in_2_files1.stats A testdata/ice_puffin/generated/duplicate_stats_in_2_files2.stats A testdata/ice_puffin/generated/existing_file.stats A testdata/ice_puffin/generated/file_contains_invalid_field_id.metadata.json A testdata/ice_puffin/generated/file_contains_invalid_field_id.stats A testdata/ice_puffin/generated/missing_file.metadata.json A testdata/ice_puffin/generated/non_corrupt_file.stats A testdata/ice_puffin/generated/not_all_blobs_current.metadata.json A testdata/ice_puffin/generated/not_all_blobs_current.stats A testdata/ice_puffin/generated/not_current_snapshot_id.stats A testdata/ice_puffin/generated/one_file_corrupt_one_not.metadata.json A testdata/ice_puffin/generated/one_file_current_one_not.metadata.json A testdata/ice_puffin/generated/stats_divided.metadata.json A testdata/ice_puffin/generated/stats_divided1.stats A testdata/ice_puffin/generated/stats_divided2.stats A testdata/ice_puffin/snap-1072310337995780425-1-f40f834f-df8a-4220-8377-2d2bd7be1150.avro M tests/common/impala_test_suite.py A tests/custom_cluster/test_iceberg_with_puffin_stats_startup_flag.py M tests/metadata/test_ddl_base.py M tests/query_test/test_ext_data_sources.py M tests/query_test/test_iceberg.py 55 files changed, 3,759 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/05/21605/19 -- To view, visit http://gerrit.cloudera.org:8080/21605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866 Gerrit-Change-Number: 21605 Gerrit-PatchSet: 19 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>