prashantwason commented on code in PR #18182:
URL: https://github.com/apache/hudi/pull/18182#discussion_r3230492213
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2281,6 +2298,78 @@ public static String
deleteMetadataTablePartition(HoodieTableMetaClient dataMeta
return null;
}
+ /**
+ * Returns true if bucketing is enabled for the metadata table.
+ *
+ * @param metaClient MetaClient for the metadata table
+ * @return true if bucketing is enabled
+ */
+ private static boolean isBucketingEnabledForMDT(HoodieTableMetaClient
metaClient) {
+ // Bucketing status is saved within the main dataset's properties.
+ String basePath =
HoodieTableMetadata.getDataTableBasePathFromMetadataTable(metaClient.getBasePath().toString());
+ return HoodieTableMetaClient.builder()
+ .setBasePath(basePath)
+ .setConf(metaClient.getStorageConf().newInstance())
+ .setLoadActiveTimelineOnLoad(false)
+ .build()
+ .getTableConfig()
+ .isMetadataTablePartitionBucketingEnabled();
+ }
+
+ /**
+ * Returns the list of partitions to list for file slices.
+ * When bucketing is enabled, returns the bucket sub-directories.
+ * When bucketing is disabled, returns the partition itself.
+ *
+ * @param metaClient MetaClient for the metadata table
+ * @param partition The partition path
+ * @param bucketingEnabled Whether bucketing is enabled
+ * @return List of partition paths to list
+ */
+ private static List<String> getPartitionsToList(HoodieTableMetaClient
metaClient, String partition, boolean bucketingEnabled) {
+ List<String> partitionsToList = new ArrayList<>();
+ try {
+ if (bucketingEnabled) {
+ // Find all the buckets in the partition
+ StoragePath partitionPath = new StoragePath(metaClient.getBasePath(),
partition);
Review Comment:
Acknowledged, but deferring within this PR. `listDirectEntries` is
structurally needed to enumerate which buckets exist for a partition, since the
bucket count is per-partition (= ceil(fileGroupCount / bucketSize)) and
`fileGroupCount` is not currently tracked per partition in MDT table config.
Eliminating the listing cleanly requires persisting per-partition bucket counts
at MDT init time, which is an invasive change worth tracking separately. The
cost in the current implementation is one `listDirectEntries` per partition per
call (not per file slice), which is bounded.
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -628,6 +628,16 @@ public long getMaxLogFileSize() {
return getLong(MAX_LOG_FILE_SIZE_BYTES_PROP);
}
+ public static final ConfigProperty<Boolean> FILE_GROUP_BUCKETING_ENABLE =
ConfigProperty
+ .key(METADATA_PREFIX + ".file.group.bucketing.enable")
+ .defaultValue(false)
+ .withDocumentation("This flag indicates whether there should be
intermediate partitions under partitions such as record index and filelisting");
Review Comment:
Added `checkArgument(bucketSize > 0, ...)` in `getFileGroupBucketSize()` in
39755a7e so a misconfiguration fails fast rather than producing an
`ArithmeticException` deep inside MDT init.
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
@@ -1458,11 +1460,10 @@ private void
verifyMetadataRecordKeyExcludeFromPayloadLogFiles(HoodieTable table
* these records should have additional meta fields in the payload. When key
deduplication
* is enabled, these records on the disk should have key in the payload as
empty string.
*
- * @param table
* @param logFiles - Metadata table log files to be verified
* @param enableMetaFields - Enable meta fields for records
*/
- private static void verifyMetadataRawRecords(HoodieTable table,
List<HoodieLogFile> logFiles, boolean enableMetaFields) throws IOException {
+private static void verifyMetadataRawRecords(HoodieTable table,
List<HoodieLogFile> logFiles, boolean enableMetaFields) throws IOException {
HoodieStorage storage = table.getStorage();
Review Comment:
Confirmed real compile error (signature was 3-arg with `HoodieTable table`,
two of three call sites passed only 2 args). Took Greptile's clean-fix path in
39755a7e: changed the signature to take `HoodieStorage storage` instead of
`HoodieTable table`, removed the now-unused `HoodieSparkTable.create(...)` at
the third call site, and updated all three call sites to pass `storage` (which
is already in scope via the metaClient or as a parameter). All three callers
are now consistent.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1039,17 +1039,58 @@ private List<DirectoryInfo>
listAllPartitionsFromMDT(String initializationTime,
}
/**
- * Initialize file groups for a partition. For file listing, we just have
one file group.
- * <p>
- * All FileGroups for a given metadata partition has a fixed prefix as per
the {@link MetadataPartitionType#getFileIdPrefix()}.
- * Each file group is suffixed with 4 digits with increments of 1 starting
with 0000.
- * <p>
- * Let's say we configure 10 file groups for record level index partition,
and prefix as "record-index-bucket-"
- * File groups will be named as :
- * record-index-bucket-0000, .... -> ..., record-index-bucket-0009
+ * Initialize file groups for a MDT partition.
+ *
+ * All FileGroups for a given metadata partition have a fixed prefix as per
the {@link MetadataPartitionType#getFileIdPrefix()}.
+ * Each file group is suffixed with 4 digits with increments of 1 starting
with 0000. To limit the number of files that
+ * are created in a single folder, the file groups can be arranged in
buckets.
+ *
+ * ==== When file group bucketing is disabled
+ * Let's say we configure 2000 file groups for record index partition. File
groups will be created as follows inside the
+ * record_index partition directory:
+ * record_index/
+ * record_index/.hoodie_partition_metadata
+ * record_index/record-index-0000
+ * record_index/record-index-0001
+ * ...
+ * record_index/record-index-1999
+ *
+ * So 1000 file groups are created in the metadata partition directory.
+ *
+ * ==== When file group bucketing is enabled
+ * Let's say we configure 2000 file groups for record index partition. File
groups will be created as follows in 2
+ * buckets (0000 and 0001) in the record_index partition directory.:
+ * record_index/
+ * record_index/000/.hoodie_partition_metadata
+ * record_index/0000/record-index-0000
Review Comment:
Fixed both in 39755a7e: "1000 file groups" -> "2000 file groups" (matches
the example) and `record_index/000/` -> `record_index/0000/` (4 digits,
consistent with the surrounding entries).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]