yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1569972641
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -224,6 +224,10 @@ private void enablePartitions() {
if (dataMetaClient.getFunctionalIndexMetadata().isPresent()) {
this.enabledPartitionTypes.add(FUNCTIONAL_INDEX);
}
+ if ((metadataConfig.isPartitionStatsIndexEnabled() &&
!metadataConfig.getColumnsEnabledForColumnStatsIndex().isEmpty())
+ ||
dataMetaClient.getTableConfig().isMetadataPartitionAvailable(PARTITION_STATS)) {
+ this.enabledPartitionTypes.add(PARTITION_STATS);
Review Comment:
Can this part be generalized too, i.e., relying on the defined partition
type enums to automatically check the corresponding write config and enable the
index, instead of modifying the method manually?
##########
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##########
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat
fileFormat) {
throw new UnsupportedOperationException(fileFormat.name() + " format not
supported yet.");
}
+ /**
+ * Aggregate column range statistics across files in a partition.
+ *
+ * @param fileRanges List of column range statistics for each file in a
partition
+ */
+ public static <T extends Comparable<T>> HoodieColumnRangeMetadata<T>
getColumnRangeInPartition(@Nonnull List<HoodieColumnRangeMetadata<T>>
fileRanges) {
+ if (fileRanges.size() == 1) {
+ // Only one parquet file, we can just return that range.
+ return fileRanges.get(0);
+ }
+ // There are multiple files. Compute min(file_mins) and max(file_maxs)
+ return fileRanges.stream()
+ .sequential()
+ .reduce(BaseFileUtils::mergeRanges).get();
+ }
+
+ private static <T extends Comparable<T>> HoodieColumnRangeMetadata<T>
mergeRanges(HoodieColumnRangeMetadata<T> one,
Review Comment:
```suggestion
private static <T extends Comparable<T>> HoodieColumnRangeMetadata<T>
mergeRanges(HoodieColumnRangeMetadata<T> one,
```
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1580,69 +1595,4 @@ public boolean isInitialized() {
}
protected abstract BaseHoodieWriteClient<?, I, ?, ?> initializeWriteClient();
-
- /**
- * A class which represents a directory and the files and directories inside
it.
- * <p>
- * A {@code PartitionFileInfo} object saves the name of the partition and
various properties requires of each file
- * required for initializing the metadata table. Saving limited properties
reduces the total memory footprint when
- * a very large number of files are present in the dataset being initialized.
- */
Review Comment:
So this is no longer needed?
##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java:
##########
@@ -557,7 +558,8 @@ public void
testDowngradeSixToFiveShouldDeleteRecordIndexPartition() throws Exce
PARTITION_NAME_COLUMN_STATS,
PARTITION_NAME_BLOOM_FILTERS,
PARTITION_NAME_RECORD_INDEX,
- PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX
+ PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX,
Review Comment:
Should this list be generated from the production code, i.e., list of
supported MDT partitions? Also, do we need to upgrade the table version? I
think master branch is still on table version 6, the same as 0.14.0 release.
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##########
@@ -104,13 +104,14 @@ public class HoodieMetadataPayload implements
HoodieRecordPayload<HoodieMetadata
private static final Logger LOG =
LoggerFactory.getLogger(HoodieMetadataPayload.class);
/**
* Type of the record. This can be an enum in the schema but Avro1.8
- * has a bug - https://issues.apache.org/jira/browse/AVRO-1810
+ * has a bug - <a
href="https://issues.apache.org/jira/browse/AVRO-1810">...</a>
*/
- protected static final int METADATA_TYPE_PARTITION_LIST = 1;
- protected static final int METADATA_TYPE_FILE_LIST = 2;
- protected static final int METADATA_TYPE_COLUMN_STATS = 3;
- protected static final int METADATA_TYPE_BLOOM_FILTER = 4;
+ private static final int METADATA_TYPE_PARTITION_LIST = 1;
+ private static final int METADATA_TYPE_FILE_LIST = 2;
+ private static final int METADATA_TYPE_COLUMN_STATS = 3;
+ private static final int METADATA_TYPE_BLOOM_FILTER = 4;
private static final int METADATA_TYPE_RECORD_INDEX = 5;
+ private static final int METADATA_TYPE_PARTITION_STATS = 6;
Review Comment:
Maybe we should add enum and guarantee the ordering, and automatically
assign the type ID?
##########
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##########
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat
fileFormat) {
throw new UnsupportedOperationException(fileFormat.name() + " format not
supported yet.");
}
+ /**
+ * Aggregate column range statistics across files in a partition.
+ *
+ * @param fileRanges List of column range statistics for each file in a
partition
+ */
+ public static <T extends Comparable<T>> HoodieColumnRangeMetadata<T>
getColumnRangeInPartition(@Nonnull List<HoodieColumnRangeMetadata<T>>
fileRanges) {
+ if (fileRanges.size() == 1) {
+ // Only one parquet file, we can just return that range.
+ return fileRanges.get(0);
+ }
+ // There are multiple files. Compute min(file_mins) and max(file_maxs)
+ return fileRanges.stream()
+ .sequential()
+ .reduce(BaseFileUtils::mergeRanges).get();
+ }
+
+ private static <T extends Comparable<T>> HoodieColumnRangeMetadata<T>
mergeRanges(HoodieColumnRangeMetadata<T> one,
Review Comment:
Do we already have such method for merging column ranges for col stats and
could we reuse that?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]