prashantwason commented on code in PR #18182:
URL: https://github.com/apache/hudi/pull/18182#discussion_r3230492746


##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2281,6 +2298,78 @@ public static String 
deleteMetadataTablePartition(HoodieTableMetaClient dataMeta
     return null;
   }
 
+  /**
+   * Returns true if bucketing is enabled for the metadata table.
+   *
+   * @param metaClient MetaClient for the metadata table
+   * @return true if bucketing is enabled
+   */
+  private static boolean isBucketingEnabledForMDT(HoodieTableMetaClient 
metaClient) {
+    // Bucketing status is saved within the main dataset's properties.
+    String basePath = 
HoodieTableMetadata.getDataTableBasePathFromMetadataTable(metaClient.getBasePath().toString());
+    return HoodieTableMetaClient.builder()
+        .setBasePath(basePath)
+        .setConf(metaClient.getStorageConf().newInstance())
+        .setLoadActiveTimelineOnLoad(false)
+        .build()
+        .getTableConfig()
+        .isMetadataTablePartitionBucketingEnabled();
+  }
+
+  /**
+   * Returns the list of partitions to list for file slices.

Review Comment:
   Same root cause as your earlier comment on the rebuild — fixed in 39755a7e 
by persisting the bucketing flag on the MDT's own `hoodie.properties` and 
reading it directly from the MDT metaClient's table config in 
`isBucketingEnabledForMDT`. No more per-call disk read or HoodieTableMetaClient 
construction.



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java:
##########
@@ -647,7 +650,21 @@ private void runFullValidation(HoodieMetadataConfig 
metadataConfig,
     // Metadata table has a fixed number of partitions
     // Cannot use FSUtils.getAllFoldersWithPartitionMetaFile for this as that 
function filters all directory
     // in the .hoodie folder.
-    List<String> metadataTablePartitions = 
FSUtils.getAllPartitionPaths(engineContext, metadataMetaClient, false);
+    List<String> metadataTablePartitions = 
FSUtils.getAllPartitionPaths(engineContext, 
HoodieTableMetadata.getMetadataTableBasePath(basePath),
+        false, false);
+
+    List<MetadataPartitionType> enabledPartitionTypes = 
metadataWriter.getEnabledPartitionTypes();
+    if (writeConfig.getMetadataConfig().isFileGroupBucketingEnabled()) {
+      // First bucket should be present
+      
assertTrue(metadataTablePartitions.contains(HoodieTableMetadataUtil.getBucketRelativePath(MetadataPartitionType.FILES,
 0)));
+    } else {
+      // partition path is the only partition
+      assertEquals(enabledPartitionTypes.size(), 
metadataTablePartitions.size());
+      
assertTrue(metadataTablePartitions.contains(MetadataPartitionType.FILES.getPartitionPath()));
+    }

Review Comment:
   Fixed in 39755a7e. The bucketing branch now iterates over 
`enabledPartitionTypes` and asserts each partition has its bucket-0 
sub-directory present, with the partition path included in the failure message 
for easier debugging. RECORD_INDEX, COLUMN_STATS, etc. are now actually 
verified.



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
@@ -3760,7 +3762,7 @@ public void testDeleteWithRecordIndex() throws Exception {
     }

Review Comment:
   Restored the 2-space indent on `validateMetadata(SparkRDDWriteClient)` (line 
3762), `verifyMetadataRawRecords` (line 1466), and a third de-indented line in 
`validateMetadata(...)` around line 3989 in 39755a7e.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1060,37 +1101,63 @@ private void initializeFileGroups(HoodieTableMetaClient 
dataMetaClient, Metadata
     // during initial commit, then the fileGroup would still be recognized (as 
a FileSlice with no baseFiles but a
     // valid logFile). Since these log files being created have no content, it 
is safe to add them here before
     // the bulkInsert.
-    final String msg = String.format("Creating %d file groups for partition %s 
with base fileId %s at instant time %s",
-        fileGroupCount, relativePartitionPath, 
metadataPartition.getFileIdPrefix(), instantTime);
+    final int bucketCount = (int) Math.ceil((float) fileGroupCount / 
metadataWriteConfig.getMetadataConfig().getFileGroupBucketSize());
+    List<Pair<String, String>> fileGroupIdAndPathPairList;
+    final String msg;
+    if (bucketingEnabled) {
+      msg = String.format("Creating bucketed file groups for MDT partition %s: 
#buckets=%d, #fileGroups=%d, fileId=%s, instant_time=%s",
+          relativePartitionPath, bucketCount, fileGroupCount, 
metadataPartition.getFileIdPrefix(), instantTime);
+      fileGroupIdAndPathPairList = IntStream.range(0, fileGroupCount)
+          // Since bucketing is enabled, the filegroups will be located in the 
bucket directory inside metadata partition directory
+          .mapToObj(i -> {
+            final int bucketIndex = i / 
metadataWriteConfig.getMetadataConfig().getFileGroupBucketSize();
+            final String bucketPath = 
HoodieTableMetadataUtil.getBucketRelativePath(relativePartitionPath, 
bucketIndex);
+            return 
Pair.of(HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i, 
relativePartitionPath, dataPartitionName), bucketPath);
+          })
+          .collect(Collectors.toList());
+    } else {
+      msg = String.format("Creating file groups for MDT partition %s: 
#fileGroups=%d, fileId=%s, instant_time=%s",
+          relativePartitionPath, fileGroupCount, 
metadataPartition.getFileIdPrefix(), instantTime);
+      fileGroupIdAndPathPairList = IntStream.range(0, fileGroupCount)
+          // Since bucketing is disabled, the filegroups will be located 
directly under the metadata partition directory
+          .mapToObj(i -> 
Pair.of(HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i, 
relativePartitionPath, dataPartitionName), relativePartitionPath))
+          .collect(Collectors.toList());
+    }
+
+    ValidationUtils.checkArgument(fileGroupIdAndPathPairList.size() == 
fileGroupCount);
     LOG.info(msg);
-    final List<String> fileGroupFileIds = IntStream.range(0, fileGroupCount)
-        .mapToObj(i -> 
HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i, 
relativePartitionPath, dataPartitionName))
-        .collect(Collectors.toList());
-    ValidationUtils.checkArgument(fileGroupFileIds.size() == fileGroupCount);
     engineContext.setJobStatus(this.getClass().getSimpleName(), msg);
-    engineContext.foreach(fileGroupFileIds, fileGroupFileId -> {
+    engineContext.foreach(fileGroupIdAndPathPairList, p -> {
+      final String fileGroupFileId = p.getKey();
+      final String relativePath = p.getValue();
+
       try {
         final Map<HeaderMetadataType, String> blockHeader = 
Collections.singletonMap(HeaderMetadataType.INSTANT_TIME, instantTime);
 
         final HoodieDeleteBlock block = new 
HoodieDeleteBlock(Collections.emptyList(), blockHeader);
 
         try (HoodieLogFormat.Writer writer = HoodieLogFormat.newWriterBuilder()
-            
.onParentPath(FSUtils.constructAbsolutePath(metadataWriteConfig.getBasePath(), 
relativePartitionPath))
+            
.onParentPath(FSUtils.constructAbsolutePath(metadataWriteConfig.getBasePath(), 
relativePath))
             .withFileId(fileGroupFileId)
             .withInstantTime(instantTime)
             .withLogVersion(HoodieLogFile.LOGFILE_BASE_VERSION)
             .withFileSize(0L)
             .withSizeThreshold(metadataWriteConfig.getLogFileMaxSize())
-            .withStorage(dataMetaClient.getStorage())
+            .withStorage(storage)
             .withLogWriteToken(HoodieLogFormat.DEFAULT_WRITE_TOKEN)
             .withTableVersion(metadataWriteConfig.getWriteVersion())
             .withFileExtension(HoodieLogFile.DELTA_EXTENSION).build()) {
           writer.appendBlock(block);
         }
       } catch (InterruptedException e) {
-        throw new HoodieException(String.format("Failed to created fileGroup 
%s for partition %s", fileGroupFileId, relativePartitionPath), e);
+        throw new HoodieException(String.format("Failed to created fileGroup 
%s for partition %s", fileGroupFileId, relativePath), e);
       }
-    }, fileGroupFileIds.size());
+    }, fileGroupIdAndPathPairList.size());
+
+    if (bucketingEnabled && 
!dataMetaClient.getTableConfig().isMetadataTablePartitionBucketingEnabled()) {
+      // Bucketing has been enabled so set it in the table config
+      dataMetaClient = 
HoodieTableMetadataUtil.setMetadataTablePartitionBucketing(dataMetaClient, 
bucketingEnabled);

Review Comment:
   Fixed in 39755a7e: assigns to `this.dataMetaClient` (the writer's instance 
field) instead of the shadowing local parameter. Added a comment explaining the 
shadowing hazard.



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -937,6 +947,14 @@ public boolean isDropMetadataIndex(String indexName) {
     return subIndexNameToDrop.contains(indexName);
   }
 
+  public boolean isFileGroupBucketingEnabled() {
+    return getBoolean(FILE_GROUP_BUCKETING_ENABLE);
+  }
+
+  public int getFileGroupBucketSize() {
+    return getInt(FILE_GROUP_BUCKET_SIZE);
+  }

Review Comment:
   Added `checkArgument(bucketSize > 0, ...)` in `getFileGroupBucketSize()` in 
39755a7e. Misconfigurations now fail at the config-getter rather than as an 
`ArithmeticException` inside `initializeFileGroups`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to