Re: [PR] Improve Logical Type Handling on Col Stats [hudi]

via GitHub Tue, 16 Sep 2025 14:50:29 -0700


jonvex commented on code in PR #13711:
URL: https://github.com/apache/hudi/pull/13711#discussion_r2353725863



##########
hudi-common/src/main/java/org/apache/hudi/common/util/FileFormatUtils.java:
##########
@@ -64,10 +66,24 @@ public abstract class FileFormatUtils {
    * @param fileColumnRanges List of column range statistics for each file in 
a partition
    */
   public static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
getColumnRangeInPartition(String relativePartitionPath,
+                                                                               
                  String columnName,
                                                                                
                  @Nonnull List<HoodieColumnRangeMetadata<T>> fileColumnRanges,
-                                                                               
                  Map<String, Schema> colsToIndexSchemaMap) {
+                                                                               
                  Map<String, Schema> colsToIndexSchemaMap,
+                                                                               
                  HoodieIndexVersion indexVersion) {
 
     ValidationUtils.checkArgument(!fileColumnRanges.isEmpty(), 
"fileColumnRanges should not be empty.");
+    if (indexVersion.greaterThanOrEquals(HoodieIndexVersion.V2)) {
+      ValueMetadata valueMetadata = 
ValueMetadata.getValueMetadata(colsToIndexSchemaMap.get(columnName), 
indexVersion);
+      return fileColumnRanges.stream()
+          .map(e -> {
+            T minValue = (T) 
valueMetadata.standardizeJavaTypeAndPromote(e.getMinValue());
+            T maxValue = (T) 
valueMetadata.standardizeJavaTypeAndPromote(e.getMaxValue());
+            return HoodieColumnRangeMetadata.create(
+                relativePartitionPath, e.getColumnName(), minValue, maxValue, 
e.getNullCount(), e.getValueCount(), e.getTotalSize(),
+                e.getTotalUncompressedSize(), valueMetadata);
+          }).reduce(HoodieColumnRangeMetadata::merge).orElseThrow(() -> new 
HoodieException("MergingColumnRanges failed."));
+    }
+    // we are reducing using merge so IDK why we think there are multiple cols 
that need to go through schema evolution

Review Comment:
   My point was that "colToIndex" should be the same for every range in 
fileColumnRanges. And that the code acknowledges that fact by merging all of 
them together right after this. So I was saying that it's weird that this code 
acts like they could be different and uses a map schemaSeenForColsToIndex when 
there will at most be 1 key. In fact in HoodieColumnRangeMetadata.merge we even 
validate that the col names are the same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improve Logical Type Handling on Col Stats [hudi]

Reply via email to