This is an automated email from the ASF dual-hosted git repository.
voonhous pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new fb08a156b6f6 perf(metadata): Resolve column-stats field schemas once
per collection instead of per record (#19000)
fb08a156b6f6 is described below
commit fb08a156b6f62324864759669d1c88c0e3faf282
Author: voonhous <[email protected]>
AuthorDate: Wed Jun 17 02:37:44 2026 +0800
perf(metadata): Resolve column-stats field schemas once per collection
instead of per record (#19000)
* perf(metadata): Resolve column-stats field schemas once per collection
instead of per record
collectColumnRangeMetadata iterated the target columns inside the per-record
loop and, for every record and every target field, recomputed values that
depend only on the fixed target field list:
- field.schema().getNonNullType() rebuilds the union-member wrappers for
nullable fields (a fresh HoodieSchema per call), and
- because that non-null schema was a fresh instance per record, its
toAvroSchema() (used for min/max compares) never memoized.
Resolve the non-null HoodieSchema once per target field before the loop and
iterate that precomputed list; holding a stable instance also lets
toAvroSchema() memoize across records. Per-record work (type-support check,
value extraction, min/max/null/value counts) is unchanged, so results are
identical. Covered by the existing column-stats functional tests.
Closes #18999
* review(metadata): stream per-field schema resolution and drop
toAvroSchema memoize note
---
.../org/apache/hudi/metadata/HoodieTableMetadataUtil.java | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
index 4823f041bae6..cfdeb45b84ed 100644
---
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
+++
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
@@ -276,15 +276,19 @@ public class HoodieTableMetadataUtil {
final Properties properties = new Properties();
properties.setProperty(HoodieStorageConfig.WRITE_UTC_TIMEZONE.key(),
storageConfig.getString(HoodieStorageConfig.WRITE_UTC_TIMEZONE.key(),
HoodieStorageConfig.WRITE_UTC_TIMEZONE.defaultValue().toString()));
+ // getNonNullType() rebuilds the union-member wrappers for nullable fields
and depends only on the
+ // (fixed) target fields, so resolve it once per field instead of once per
record per field.
+ List<Pair<String, HoodieSchema>> nonNullFieldSchemas =
targetFields.stream()
+ .map(p -> Pair.of(p.getKey(), p.getValue().schema().getNonNullType()))
+ .collect(Collectors.toList());
// Collect stats for all columns by iterating through records while
accounting
// corresponding stats
records.forEachRemaining((record) -> {
// For each column (field) we have to index update corresponding column
stats
// with the values from this record
- targetFields.forEach(fieldNameFieldPair -> {
- String fieldName = fieldNameFieldPair.getKey();
- HoodieSchemaField field = fieldNameFieldPair.getValue();
- HoodieSchema fieldSchema = field.schema().getNonNullType();
+ nonNullFieldSchemas.forEach(fieldNameSchemaPair -> {
+ String fieldName = fieldNameSchemaPair.getKey();
+ HoodieSchema fieldSchema = fieldNameSchemaPair.getValue();
if (!isColumnTypeSupported(fieldSchema,
Option.of(record.getRecordType()), indexVersion)) {
return;
}