Hello Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/21608

to look at the new patch set (#23).

Change subject: IMPALA-13254: Optimize REFRESH for Iceberg tables
......................................................................

IMPALA-13254: Optimize REFRESH for Iceberg tables

Considering that Iceberg's ContentFile is a collection of immutable
files, the current code logic has been simplified. The optimized
process is as follows:

1. For existing ContentFiles, directly reuse the existing file
    descriptors.
2. For newly added ContentFiles that do not support block locations,
    directly create file descriptors.
3. For newly added ContentFiles that support block locations,
    choose between using a listLocatedStatus operation or calling
    getFileBlockLocations one by one, based on the number of files.

A simple performance comparison test has been conducted in a
single-node environment. The test used the following data tables:
- non_partitioned_table: No partitions, containing 10,000 files
- partitioned_table_1: Contains 10,000 partitions, each with 1 file
- partitioned_table_2: Contains 300 partitions, each with 300 files

and scenarios tested:
- FULL: Perform REFRESH after executing INVALIDATE METADATA
- ADD_1_FILES: Insert 1 file using Hive and then perform REFRESH
- ADD_101_FILES: Insert 101 files using Hive and then perform REFRESH

The test results of the new version are as follows:
+------------------------+----------+-------------+----------------+
|         Table          |   FULL   | ADD_1_FILES | ADD_101_FILES   |
+------------------------+----------+-------------+----------------+
| non_partitioned_table  | 356.389ms|   40.015ms  |   302.435ms     |
| partitioned_table_1    | 288.798ms|   26.667ms  |    33.035ms     |
| partitioned_table_2    |  1s436ms |  237.057ms  |   225.749ms     |
+------------------------+----------+-------------+----------------+

The test results of the old version are as follows:
+------------------------+----------+-------------+----------------+
|         Table          |   FULL   | ADD_1_FILES | ADD_101_FILES   |
+------------------------+----------+-------------+----------------+
| non_partitioned_table  |  338ms   |  57.156ms   |   12s903ms      |
| partitioned_table_1    |  281ms   |  40.525ms   |   12s743ms      |
| partitioned_table_2    | 1s397ms  | 336.965ms   |   1m57s         |
+------------------------+----------+-------------+----------------+

It can be observed that when the number of newly added files exceeds
iceberg_reload_new_files_threshold, REFRESH performance improves
significantly, while there is no noticeable change in other scenarios.

Change-Id: I8c99a28eb16275efdff52e0ea2711c0c6036719
---
M be/src/common/global-flags.cc
M fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
3 files changed, 143 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/21608/23
--
To view, visit http://gerrit.cloudera.org:8080/21608
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c99a28eb16275efdff52e0ea2711c0c6036719
Gerrit-Change-Number: 21608
Gerrit-PatchSet: 23
Gerrit-Owner: Fu Lili <[email protected]>
Gerrit-Reviewer: Fu Lili <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to