Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

via GitHub Wed, 17 Apr 2024 22:10:50 -0700


yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1569972641



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -224,6 +224,10 @@ private void enablePartitions() {
     if (dataMetaClient.getFunctionalIndexMetadata().isPresent()) {
       this.enabledPartitionTypes.add(FUNCTIONAL_INDEX);
     }
+    if ((metadataConfig.isPartitionStatsIndexEnabled() && 
!metadataConfig.getColumnsEnabledForColumnStatsIndex().isEmpty())
+        || 
dataMetaClient.getTableConfig().isMetadataPartitionAvailable(PARTITION_STATS)) {
+      this.enabledPartitionTypes.add(PARTITION_STATS);

Review Comment:
   Can this part be generalized too, i.e., relying on the defined partition 
type enums to automatically check the corresponding write config and enable the 
index, instead of modifying the method manually?



##########
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##########
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
     throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileRanges List of column range statistics for each file in a 
partition
+   */
+  public static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
getColumnRangeInPartition(@Nonnull List<HoodieColumnRangeMetadata<T>> 
fileRanges) {
+    if (fileRanges.size() == 1) {
+      // Only one parquet file, we can just return that range.
+      return fileRanges.get(0);
+    }
+    // There are multiple files. Compute min(file_mins) and max(file_maxs)
+    return fileRanges.stream()
+        .sequential()
+        .reduce(BaseFileUtils::mergeRanges).get();
+  }
+
+  private static  <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
mergeRanges(HoodieColumnRangeMetadata<T> one,

Review Comment:
   ```suggestion
     private static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
mergeRanges(HoodieColumnRangeMetadata<T> one,
   ```



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1580,69 +1595,4 @@ public boolean isInitialized() {
   }
 
   protected abstract BaseHoodieWriteClient<?, I, ?, ?> initializeWriteClient();
-
-  /**
-   * A class which represents a directory and the files and directories inside 
it.
-   * <p>
-   * A {@code PartitionFileInfo} object saves the name of the partition and 
various properties requires of each file
-   * required for initializing the metadata table. Saving limited properties 
reduces the total memory footprint when
-   * a very large number of files are present in the dataset being initialized.
-   */

Review Comment:
   So this is no longer needed?



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java:
##########
@@ -557,7 +558,8 @@ public void 
testDowngradeSixToFiveShouldDeleteRecordIndexPartition() throws Exce
         PARTITION_NAME_COLUMN_STATS,
         PARTITION_NAME_BLOOM_FILTERS,
         PARTITION_NAME_RECORD_INDEX,
-        PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX
+        PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX,

Review Comment:
   Should this list be generated from the production code, i.e., list of 
supported MDT partitions?  Also, do we need to upgrade the table version?  I 
think master branch is still on table version 6, the same as 0.14.0 release.



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##########
@@ -104,13 +104,14 @@ public class HoodieMetadataPayload implements 
HoodieRecordPayload<HoodieMetadata
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieMetadataPayload.class);
   /**
    * Type of the record. This can be an enum in the schema but Avro1.8
-   * has a bug - https://issues.apache.org/jira/browse/AVRO-1810
+   * has a bug - <a 
href="https://issues.apache.org/jira/browse/AVRO-1810";>...</a>
    */
-  protected static final int METADATA_TYPE_PARTITION_LIST = 1;
-  protected static final int METADATA_TYPE_FILE_LIST = 2;
-  protected static final int METADATA_TYPE_COLUMN_STATS = 3;
-  protected static final int METADATA_TYPE_BLOOM_FILTER = 4;
+  private static final int METADATA_TYPE_PARTITION_LIST = 1;
+  private static final int METADATA_TYPE_FILE_LIST = 2;
+  private static final int METADATA_TYPE_COLUMN_STATS = 3;
+  private static final int METADATA_TYPE_BLOOM_FILTER = 4;
   private static final int METADATA_TYPE_RECORD_INDEX = 5;
+  private static final int METADATA_TYPE_PARTITION_STATS = 6;

Review Comment:
   Maybe we should add enum and guarantee the ordering, and automatically 
assign the type ID?



##########
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##########
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
     throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileRanges List of column range statistics for each file in a 
partition
+   */
+  public static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
getColumnRangeInPartition(@Nonnull List<HoodieColumnRangeMetadata<T>> 
fileRanges) {
+    if (fileRanges.size() == 1) {
+      // Only one parquet file, we can just return that range.
+      return fileRanges.get(0);
+    }
+    // There are multiple files. Compute min(file_mins) and max(file_maxs)
+    return fileRanges.stream()
+        .sequential()
+        .reduce(BaseFileUtils::mergeRanges).get();
+  }
+
+  private static  <T extends Comparable<T>> HoodieColumnRangeMetadata<T> 
mergeRanges(HoodieColumnRangeMetadata<T> one,

Review Comment:
   Do we already have such method for merging column ranges for col stats and 
could we reuse that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

Reply via email to