yihua commented on code in PR #13711: URL: https://github.com/apache/hudi/pull/13711#discussion_r2370366329
########## hudi-utilities/src/test/resources/col-stats/colstats-upgrade-test-v6.zip: ########## Review Comment: https://github.com/apache/hudi/pull/13711#discussion_r2353811554 Reminder on adding a README.md on how these test table artifacts are generated and what aspects are covered. You can pull out the script/README from the zip so it's easy to discover and read. ########## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala: ########## @@ -192,9 +195,12 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath: String, val fixedPartitionIndexes = fixedPartitionIndexesArr.toSet // schema that we want fg reader to output to us + val exclusionFields = new java.util.HashSet[String]() + exclusionFields.add("op") + partitionSchema.fields.foreach(f => exclusionFields.add(f.name)) val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name))) - val requestedAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName) - val dataAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName) + val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName), exclusionFields) + val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName), exclusionFields) Review Comment: I still think the logic here around `op` field should not be here. @jonvex could you follow up on why this is needed here? ########## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkMetadataWriterUtils.java: ########## @@ -206,6 +212,16 @@ public static ExpressionIndexComputationMetadata getExpressionIndexRecordsUsingC : new ExpressionIndexComputationMetadata(colStatRecords); } + private static SparkValueMetadata getValueMetadataFromColumnRangeDataset(Dataset<Row> dataset, HoodieIndexVersion indexVersion) { Review Comment: Not following. This method only needs the schema of `dataset`, i.e., `dataset.schema()`, correct? So, why passing in the `dataset` itself which contains the data which is not needed by the method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
