Re: [PR] Improve Logical Type Handling on Col Stats [hudi]

via GitHub Mon, 22 Sep 2025 17:26:18 -0700


yihua commented on code in PR #13711:
URL: https://github.com/apache/hudi/pull/13711#discussion_r2370366329



##########
hudi-utilities/src/test/resources/col-stats/colstats-upgrade-test-v6.zip:
##########


Review Comment:
   https://github.com/apache/hudi/pull/13711#discussion_r2353811554
   Reminder on adding a README.md on how these test table artifacts are 
generated and what aspects are covered.  You can pull out the script/README 
from the zip so it's easy to discover and read.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##########
@@ -192,9 +195,12 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath: 
String,
     val fixedPartitionIndexes = fixedPartitionIndexesArr.toSet
 
     // schema that we want fg reader to output to us
+    val exclusionFields = new java.util.HashSet[String]()
+    exclusionFields.add("op")
+    partitionSchema.fields.foreach(f => exclusionFields.add(f.name))
     val requestedSchema = StructType(requiredSchema.fields ++ 
partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
-    val requestedAvroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, 
sanitizedTableName)
-    val dataAvroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, 
sanitizedTableName)
+    val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, 
AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, 
sanitizedTableName), exclusionFields)
+    val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, 
AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, 
sanitizedTableName), exclusionFields)

Review Comment:
   I still think the logic here around `op` field should not be here.  @jonvex 
could you follow up on why this is needed here?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkMetadataWriterUtils.java:
##########
@@ -206,6 +212,16 @@ public static ExpressionIndexComputationMetadata 
getExpressionIndexRecordsUsingC
         : new ExpressionIndexComputationMetadata(colStatRecords);
   }
 
+  private static SparkValueMetadata 
getValueMetadataFromColumnRangeDataset(Dataset<Row> dataset, HoodieIndexVersion 
indexVersion) {

Review Comment:
   Not following.  This method only needs the schema of `dataset`, i.e., 
`dataset.schema()`, correct? So, why passing in the `dataset` itself which 
contains the data which is not needed by the method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improve Logical Type Handling on Col Stats [hudi]

Reply via email to