prashantwason commented on code in PR #18182:
URL: https://github.com/apache/hudi/pull/18182#discussion_r3230493438
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2281,6 +2298,78 @@ public static String
deleteMetadataTablePartition(HoodieTableMetaClient dataMeta
return null;
}
+ /**
+ * Returns true if bucketing is enabled for the metadata table.
+ *
+ * @param metaClient MetaClient for the metadata table
+ * @return true if bucketing is enabled
+ */
+ private static boolean isBucketingEnabledForMDT(HoodieTableMetaClient
metaClient) {
+ // Bucketing status is saved within the main dataset's properties.
+ String basePath =
HoodieTableMetadata.getDataTableBasePathFromMetadataTable(metaClient.getBasePath().toString());
+ return HoodieTableMetaClient.builder()
+ .setBasePath(basePath)
+ .setConf(metaClient.getStorageConf().newInstance())
+ .setLoadActiveTimelineOnLoad(false)
+ .build()
+ .getTableConfig()
+ .isMetadataTablePartitionBucketingEnabled();
+ }
+
+ /**
+ * Returns the list of partitions to list for file slices.
+ * When bucketing is enabled, returns the bucket sub-directories.
+ * When bucketing is disabled, returns the partition itself.
+ *
+ * @param metaClient MetaClient for the metadata table
+ * @param partition The partition path
+ * @param bucketingEnabled Whether bucketing is enabled
+ * @return List of partition paths to list
+ */
+ private static List<String> getPartitionsToList(HoodieTableMetaClient
metaClient, String partition, boolean bucketingEnabled) {
+ List<String> partitionsToList = new ArrayList<>();
+ try {
+ if (bucketingEnabled) {
+ // Find all the buckets in the partition
+ StoragePath partitionPath = new StoragePath(metaClient.getBasePath(),
partition);
+ List<StoragePathInfo> statuses =
metaClient.getStorage().listDirectEntries(partitionPath);
+ for (StoragePathInfo status : statuses) {
+ if (status.isDirectory()) {
+ partitionsToList.add(partition + StoragePath.SEPARATOR +
status.getPath().getName());
+ }
+ }
+ log.info("Listing {} buckets in metadata table partition {}",
partitionsToList.size(), partition);
+ } else {
+ // Only list the partition folder as bucketing is not enabled
+ partitionsToList.add(partition);
+ log.info("Listing non-bucketed metadata table partition {}",
partition);
+ }
+ } catch (IOException e) {
+ throw new HoodieMetadataException("Failed to check for bucketing in
metadata table partition " + partition, e);
+ }
+ return partitionsToList;
Review Comment:
I'm pushing back on this one. By design, bucketing is fixed at MDT
initialization time — see
`HoodieBackedTableMetadataWriter.initializeFileGroups` (line 1083 onwards):
bucketing is enabled if it's already on in the data table's config, OR if the
user requested it via `hoodie.metadata.file.group.bucketing.enable` AND the MDT
is not yet initialized. There is no code path that retrofits bucketing onto an
already-initialized MDT, and the new soft-fail behavior in 39755a7e (per the
earlier review comment) preserves that invariant.
So a single MDT is either fully bucketed (every partition's file groups live
under `partition/<bucket>/`) or fully non-bucketed (every partition's file
groups live directly under `partition/`). A mixed state where legacy file
groups under `partition/` coexist with new file groups under
`partition/<bucket>/` cannot occur. Adding a defensive scan of the partition
root would only mask future bugs that violate this invariant, and would
unconditionally double the storage interactions per call.
I documented this invariant explicitly on `getPartitionsToList` in 39755a7e
so the design intent is clear in the source.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]