the-other-tim-brown commented on code in PR #14311:
URL: https://github.com/apache/hudi/pull/14311#discussion_r2566039328
##########
hudi-common/src/main/java/org/apache/hudi/common/util/FileFormatUtils.java:
##########
@@ -60,15 +61,22 @@
*/
public abstract class FileFormatUtils {
/**
- * Aggregate column range statistics across files in a partition.
+ * Aggregate column range statistics across files in a partition using
HoodieSchema.
Review Comment:
I think it would be helpful to add the context that the HoodieSchema is used
for properly extracting the stats based on the data type.
##########
hudi-common/src/main/java/org/apache/hudi/stats/ValueType.java:
##########
@@ -244,77 +243,70 @@ public static ValueType
fromParquetPrimitiveType(PrimitiveType primitiveType) {
}
}
- public static ValueType fromSchema(Schema schema) {
- switch (schema.getType()) {
+ /**
+ * Infers ValueType from HoodieSchema for type inference in column
statistics.
+ * Leverages specialized HoodieSchema subclasses to determine the
appropriate ValueType.
+ *
+ * @param schema the HoodieSchema to infer type from
+ * @return the corresponding ValueType
+ * @throws IllegalArgumentException if the schema type is not supported
+ * @since 1.2.0
+ */
+ public static ValueType fromSchema(HoodieSchema schema) {
+ // Handle logical types first using instanceof checks on specialized
classes
+ if (schema instanceof HoodieSchema.Decimal) {
Review Comment:
Instead of using `instanceof` we can just augment the switch statement below
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestColStatsRecordWithMetadataRecord.java:
##########
@@ -451,7 +452,11 @@ private HoodieColumnRangeMetadata
mergeAndAssert(HoodieColumnRangeMetadata<Compa
Map<String, Schema> colsToIndexSchemaMap = new HashMap<>();
colsToIndexSchemaMap.put(colName, Schema.create(schemaType));
- HoodieColumnRangeMetadata actualColumnRange =
FileFormatUtils.getColumnRangeInPartition(relativePartitionPath, colName,
fileColumnRanges, colsToIndexSchemaMap, V1);
+ // Convert Avro Schema map to HoodieSchema map
+ Map<String, HoodieSchema> hoodieSchemaMap =
colsToIndexSchemaMap.entrySet().stream()
+ .collect(Collectors.toMap(Map.Entry::getKey, entry ->
HoodieSchema.fromAvroSchema(entry.getValue())));
Review Comment:
Let's simplify this
```
Map<String, HoodieSchema> hoodieSchemaMap =
Collections.singletonMap(colName, HoodieSchema.create(schemaType));
```
Let's also update the method signature to take in HoodieSchemaType instead
of an avro type to help break away from the avro dependency
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1882,21 +1885,29 @@ public static Option<Schema>
tryResolveSchemaForTable(HoodieTableMetaClient data
}
/**
- * Given a schema, coerces provided value to instance of {@link
Comparable<?>} such that
- * it could subsequently be used in column stats
+ * Given a HoodieSchema, coerces provided value to instance of {@link
Comparable<?>} such that
+ * it could subsequently be used in column stats. This method uses
HoodieSchema for in-memory
+ * processing while maintaining compatibility with existing Avro-based
serialization.
*
* NOTE: This method has to stay compatible with the semantic of
* {@link FileFormatUtils#readColumnStatsFromMetadata} as they are used
in tandem
+ *
+ * @param hoodieSchema the HoodieSchema to use for type coercion
+ * @param val the value to coerce
+ * @return the coerced value as a Comparable
+ * @since 1.2.0
*/
- public static Comparable<?> coerceToComparable(Schema schema, Object val) {
+ public static Comparable<?> coerceToComparable(HoodieSchema hoodieSchema,
Object val) {
Review Comment:
nitpick: let's just leave the variable name as `schema`
##########
hudi-common/src/main/java/org/apache/hudi/stats/ValueMetadata.java:
##########
@@ -234,19 +235,32 @@ public static ValueMetadata
getValueMetadata(GenericRecord columnStatsRecord) {
}
}
- public static ValueMetadata getValueMetadata(Schema fieldSchema,
HoodieIndexVersion indexVersion) {
+ /**
+ * Creates ValueMetadata from HoodieSchema for column statistics type
inference.
+ * This method uses HoodieSchema for in-memory processing while maintaining
+ * compatibility with existing Avro-based serialization.
+ *
+ * @param fieldSchema the HoodieSchema of the field
+ * @param indexVersion the index version to determine metadata format
+ * @return ValueMetadata instance for the given schema
+ * @throws IllegalArgumentException if schema is null or has unsupported
logical type
+ * @since 1.2.0
+ */
+ public static ValueMetadata getValueMetadata(HoodieSchema fieldSchema,
HoodieIndexVersion indexVersion) {
if (indexVersion.lowerThan(HoodieIndexVersion.V2)) {
return V1EmptyMetadata.get();
}
if (fieldSchema == null) {
throw new IllegalArgumentException("Field schema cannot be null");
}
- Schema valueSchema = getNonNullTypeFromUnion(fieldSchema);
+ HoodieSchema valueSchema =
HoodieSchemaUtils.getNonNullTypeFromUnion(fieldSchema);
ValueType valueType = ValueType.fromSchema(valueSchema);
if (valueType == ValueType.V1) {
- throw new IllegalArgumentException("Unsupported logical type for: " +
valueSchema.getLogicalType());
+ Schema avroSchema = valueSchema.toAvroSchema();
+ throw new IllegalArgumentException("Unsupported logical type for: " +
avroSchema.getLogicalType());
} else if (valueType == ValueType.DECIMAL) {
- return DecimalMetadata.create((LogicalTypes.Decimal)
valueSchema.getLogicalType());
+ Schema avroSchema = valueSchema.toAvroSchema();
+ return DecimalMetadata.create((LogicalTypes.Decimal)
avroSchema.getLogicalType());
Review Comment:
Can this be updated to operate directly on HoodieSchema?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]