Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

via GitHub Fri, 26 Apr 2024 09:25:09 -0700


vinothchandar commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1564990491



##########
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##########
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
     throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileRanges List of column range statistics for each file in a 
partition
+   */
+  public static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
getColumnRangeInPartition(@Nonnull List<HoodieColumnRangeMetadata<T>> 
fileRanges) {
+    if (fileRanges.size() == 1) {
+      // Only one parquet file, we can just return that range.

Review Comment:
   fixed comments. added tests for null values in mix/max. and also an 
assertion that column names have to be the same across the objects merged.



##########
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##########
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
     throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileRanges List of column range statistics for each file in a 
partition
+   */
+  public static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
getColumnRangeInPartition(@Nonnull List<HoodieColumnRangeMetadata<T>> 
fileRanges) {

Review Comment:
   rename: fileRanges to fileColumnRanges. done



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##########
@@ -651,6 +652,36 @@ public static Stream<HoodieRecord> 
createColumnStatsRecords(String partitionName
     });
   }
 
+  public static Stream<HoodieRecord> createPartitionStatsRecords(String 
partitionPath,
+                                                                 
Collection<HoodieColumnRangeMetadata<Comparable>> columnRangeMetadataList,
+                                                                 boolean 
isDeleted) {
+    return columnRangeMetadataList.stream().map(columnRangeMetadata -> {
+      HoodieKey key = new HoodieKey(getPartitionStatsIndexKey(partitionPath, 
columnRangeMetadata),
+          MetadataPartitionType.PARTITION_STATS.getPartitionPath());
+
+      HoodieMetadataPayload payload = new 
HoodieMetadataPayload(key.getRecordKey(),
+          HoodieMetadataColumnStats.newBuilder()
+              .setFileName(null)
+              .setColumnName(columnRangeMetadata.getColumnName())
+              
.setMinValue(wrapValueIntoAvro(columnRangeMetadata.getMinValue()))
+              
.setMaxValue(wrapValueIntoAvro(columnRangeMetadata.getMaxValue()))
+              .setNullCount(columnRangeMetadata.getNullCount())
+              .setValueCount(columnRangeMetadata.getValueCount())
+              .setTotalSize(columnRangeMetadata.getTotalSize())
+              
.setTotalUncompressedSize(columnRangeMetadata.getTotalUncompressedSize())
+              .setIsDeleted(isDeleted)
+              .build());
+
+      return new HoodieAvroRecord<>(key, payload);
+    });
+  }
+
+  public static String getPartitionStatsIndexKey(String partitionPath, 
HoodieColumnRangeMetadata<Comparable> columnRangeMetadata) {

Review Comment:
   need to UT this and other key generation methods. and also tests that assert 
the values in MT storage. We should not be just testing at the query write/read 
level alone



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1901,4 +1910,162 @@ private static Path filePath(String basePath, String 
partition, String filename)
       return new Path(basePath, partition + Path.SEPARATOR + filename);
     }
   }
+
+  public static HoodieData<HoodieRecord> 
convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,

Review Comment:
   Can we do this automatically? As a user, if I am already doing `partitionBy` 
already, then its fair expectation that the index is built automatically and 
partition pruning works without any extra configs



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestDataSkippingWithMORColstats.java:
##########
@@ -318,6 +318,7 @@ private Map<String, String> getOptions() {
     options.put(HoodieMetadataConfig.ENABLE.key(), "true");
     options.put(HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key(), 
"true");
     options.put(HoodieMetadataConfig.COLUMN_STATS_INDEX_FOR_COLUMNS.key(), 
"trip_type");
+    
options.put(HoodieMetadataConfig.ENABLE_METADATA_INDEX_PARTITION_STATS.key(), 
"false");

Review Comment:
   why are we disabling this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

Reply via email to