Re: [PR] feat: (schema - phase 2) Perform Column Statistics Schema Migration [hudi]

via GitHub Wed, 26 Nov 2025 10:22:00 -0800


the-other-tim-brown commented on code in PR #14311:
URL: https://github.com/apache/hudi/pull/14311#discussion_r2566039328



##########
hudi-common/src/main/java/org/apache/hudi/common/util/FileFormatUtils.java:
##########
@@ -60,15 +61,22 @@
  */
 public abstract class FileFormatUtils {
   /**
-   * Aggregate column range statistics across files in a partition.
+   * Aggregate column range statistics across files in a partition using 
HoodieSchema.

Review Comment:
   I think it would be helpful to add the context that the HoodieSchema is used 
for properly extracting the stats based on the data type.



##########
hudi-common/src/main/java/org/apache/hudi/stats/ValueType.java:
##########
@@ -244,77 +243,70 @@ public static ValueType 
fromParquetPrimitiveType(PrimitiveType primitiveType) {
     }
   }
 
-  public static ValueType fromSchema(Schema schema) {
-    switch (schema.getType()) {
+  /**
+   * Infers ValueType from HoodieSchema for type inference in column 
statistics.
+   * Leverages specialized HoodieSchema subclasses to determine the 
appropriate ValueType.
+   *
+   * @param schema the HoodieSchema to infer type from
+   * @return the corresponding ValueType
+   * @throws IllegalArgumentException if the schema type is not supported
+   * @since 1.2.0
+   */
+  public static ValueType fromSchema(HoodieSchema schema) {
+    // Handle logical types first using instanceof checks on specialized 
classes
+    if (schema instanceof HoodieSchema.Decimal) {

Review Comment:
   Instead of using `instanceof` we can just augment the switch statement below



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestColStatsRecordWithMetadataRecord.java:
##########
@@ -451,7 +452,11 @@ private HoodieColumnRangeMetadata 
mergeAndAssert(HoodieColumnRangeMetadata<Compa
     Map<String, Schema> colsToIndexSchemaMap = new HashMap<>();
     colsToIndexSchemaMap.put(colName, Schema.create(schemaType));
 
-    HoodieColumnRangeMetadata actualColumnRange = 
FileFormatUtils.getColumnRangeInPartition(relativePartitionPath, colName, 
fileColumnRanges, colsToIndexSchemaMap, V1);
+    // Convert Avro Schema map to HoodieSchema map
+    Map<String, HoodieSchema> hoodieSchemaMap = 
colsToIndexSchemaMap.entrySet().stream()
+        .collect(Collectors.toMap(Map.Entry::getKey, entry -> 
HoodieSchema.fromAvroSchema(entry.getValue())));

Review Comment:
   Let's simplify this
   ```
       Map<String, HoodieSchema> hoodieSchemaMap = 
Collections.singletonMap(colName, HoodieSchema.create(schemaType));
   ```
   Let's also update the method signature to take in HoodieSchemaType instead 
of an avro type to help break away from the avro dependency



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1882,21 +1885,29 @@ public static Option<Schema> 
tryResolveSchemaForTable(HoodieTableMetaClient data
   }
 
   /**
-   * Given a schema, coerces provided value to instance of {@link 
Comparable<?>} such that
-   * it could subsequently be used in column stats
+   * Given a HoodieSchema, coerces provided value to instance of {@link 
Comparable<?>} such that
+   * it could subsequently be used in column stats. This method uses 
HoodieSchema for in-memory
+   * processing while maintaining compatibility with existing Avro-based 
serialization.
    *
    * NOTE: This method has to stay compatible with the semantic of
    *      {@link FileFormatUtils#readColumnStatsFromMetadata} as they are used 
in tandem
+   *
+   * @param hoodieSchema the HoodieSchema to use for type coercion
+   * @param val the value to coerce
+   * @return the coerced value as a Comparable
+   * @since 1.2.0
    */
-  public static Comparable<?> coerceToComparable(Schema schema, Object val) {
+  public static Comparable<?> coerceToComparable(HoodieSchema hoodieSchema, 
Object val) {

Review Comment:
   nitpick: let's just leave the variable name as `schema`



##########
hudi-common/src/main/java/org/apache/hudi/stats/ValueMetadata.java:
##########
@@ -234,19 +235,32 @@ public static ValueMetadata 
getValueMetadata(GenericRecord columnStatsRecord) {
     }
   }
 
-  public static ValueMetadata getValueMetadata(Schema fieldSchema, 
HoodieIndexVersion indexVersion) {
+  /**
+   * Creates ValueMetadata from HoodieSchema for column statistics type 
inference.
+   * This method uses HoodieSchema for in-memory processing while maintaining
+   * compatibility with existing Avro-based serialization.
+   *
+   * @param fieldSchema the HoodieSchema of the field
+   * @param indexVersion the index version to determine metadata format
+   * @return ValueMetadata instance for the given schema
+   * @throws IllegalArgumentException if schema is null or has unsupported 
logical type
+   * @since 1.2.0
+   */
+  public static ValueMetadata getValueMetadata(HoodieSchema fieldSchema, 
HoodieIndexVersion indexVersion) {
     if (indexVersion.lowerThan(HoodieIndexVersion.V2)) {
       return V1EmptyMetadata.get();
     }
     if (fieldSchema == null) {
       throw new IllegalArgumentException("Field schema cannot be null");
     }
-    Schema valueSchema = getNonNullTypeFromUnion(fieldSchema);
+    HoodieSchema valueSchema = 
HoodieSchemaUtils.getNonNullTypeFromUnion(fieldSchema);
     ValueType valueType = ValueType.fromSchema(valueSchema);
     if (valueType == ValueType.V1) {
-      throw new IllegalArgumentException("Unsupported logical type for: " + 
valueSchema.getLogicalType());
+      Schema avroSchema = valueSchema.toAvroSchema();
+      throw new IllegalArgumentException("Unsupported logical type for: " + 
avroSchema.getLogicalType());
     } else if (valueType == ValueType.DECIMAL) {
-      return DecimalMetadata.create((LogicalTypes.Decimal) 
valueSchema.getLogicalType());
+      Schema avroSchema = valueSchema.toAvroSchema();
+      return DecimalMetadata.create((LogicalTypes.Decimal) 
avroSchema.getLogicalType());

Review Comment:
   Can this be updated to operate directly on HoodieSchema?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: (schema - phase 2) Perform Column Statistics Schema Migration [hudi]

Reply via email to