prashantwason commented on code in PR #18182:
URL: https://github.com/apache/hudi/pull/18182#discussion_r3230490940
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1039,17 +1039,58 @@ private List<DirectoryInfo>
listAllPartitionsFromMDT(String initializationTime,
}
Review Comment:
The link is already in `{@link MetadataPartitionType#getFileIdPrefix()}`
form (line 1044) and resolves correctly: `MetadataPartitionType` is in the same
package as this class, so the simple-name reference is valid Javadoc. No change
made.
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
Review Comment:
The `private` modifier was not actually removed: `validateMetadata` is still
`private void`. The PR change was lost indentation (the line was at column 0
instead of the 2-space class-body indent). Restored the indent in 39755a7e.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1039,17 +1039,58 @@ private List<DirectoryInfo>
listAllPartitionsFromMDT(String initializationTime,
}
/**
- * Initialize file groups for a partition. For file listing, we just have
one file group.
- * <p>
- * All FileGroups for a given metadata partition has a fixed prefix as per
the {@link MetadataPartitionType#getFileIdPrefix()}.
- * Each file group is suffixed with 4 digits with increments of 1 starting
with 0000.
- * <p>
- * Let's say we configure 10 file groups for record level index partition,
and prefix as "record-index-bucket-"
- * File groups will be named as :
- * record-index-bucket-0000, .... -> ..., record-index-bucket-0009
+ * Initialize file groups for a MDT partition.
+ *
+ * All FileGroups for a given metadata partition have a fixed prefix as per
the {@link MetadataPartitionType#getFileIdPrefix()}.
+ * Each file group is suffixed with 4 digits with increments of 1 starting
with 0000. To limit the number of files that
+ * are created in a single folder, the file groups can be arranged in
buckets.
+ *
+ * ==== When file group bucketing is disabled
+ * Let's say we configure 2000 file groups for record index partition. File
groups will be created as follows inside the
+ * record_index partition directory:
+ * record_index/
+ * record_index/.hoodie_partition_metadata
+ * record_index/record-index-0000
+ * record_index/record-index-0001
+ * ...
+ * record_index/record-index-1999
+ *
+ * So 1000 file groups are created in the metadata partition directory.
+ *
+ * ==== When file group bucketing is enabled
+ * Let's say we configure 2000 file groups for record index partition. File
groups will be created as follows in 2
+ * buckets (0000 and 0001) in the record_index partition directory.:
+ * record_index/
+ * record_index/000/.hoodie_partition_metadata
+ * record_index/0000/record-index-0000
+ * record_index/0000/record-index-0001
+ * ...
+ * record_index/0000/record-index-0999
+ * record_index/0001/.hoodie_partition_metadata
+ * record_index/0001/record-index-1000
+ * record_index/0001/record-index-1001
+ * ...
+ * record_index/0001/record-index-1999
+ *
+ * So 1000 file groups are created in each bucket within the metadata
partition directory.
*/
private void initializeFileGroups(HoodieTableMetaClient dataMetaClient,
MetadataPartitionType metadataPartition, String instantTime,
int fileGroupCount, String
relativePartitionPath, Option<String> dataPartitionName) throws IOException {
+ // Capture the storage to avoid lambda capture issues with dataMetaClient
reassignment
+ final HoodieStorage storage = dataMetaClient.getStorage();
+
+ // Bucketing is enabled for the entire MDT at the time it is initialized.
The bucketing cannot be changed later.
+ boolean bucketingEnabled = false;
Review Comment:
Agreed. Replaced the throw with a warning + soft-disable of bucketing for
that partition. Existing pipelines no longer hard-fail when the user enables
`hoodie.metadata.file.group.bucketing.enable` on a table whose MDT was already
initialized without bucketing. Fixed in 39755a7e.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1060,37 +1101,63 @@ private void initializeFileGroups(HoodieTableMetaClient
dataMetaClient, Metadata
// during initial commit, then the fileGroup would still be recognized (as
a FileSlice with no baseFiles but a
// valid logFile). Since these log files being created have no content, it
is safe to add them here before
// the bulkInsert.
- final String msg = String.format("Creating %d file groups for partition %s
with base fileId %s at instant time %s",
- fileGroupCount, relativePartitionPath,
metadataPartition.getFileIdPrefix(), instantTime);
+ final int bucketCount = (int) Math.ceil((float) fileGroupCount /
metadataWriteConfig.getMetadataConfig().getFileGroupBucketSize());
+ List<Pair<String, String>> fileGroupIdAndPathPairList;
+ final String msg;
+ if (bucketingEnabled) {
+ msg = String.format("Creating bucketed file groups for MDT partition %s:
#buckets=%d, #fileGroups=%d, fileId=%s, instant_time=%s",
+ relativePartitionPath, bucketCount, fileGroupCount,
metadataPartition.getFileIdPrefix(), instantTime);
+ fileGroupIdAndPathPairList = IntStream.range(0, fileGroupCount)
+ // Since bucketing is enabled, the filegroups will be located in the
bucket directory inside metadata partition directory
+ .mapToObj(i -> {
+ final int bucketIndex = i /
metadataWriteConfig.getMetadataConfig().getFileGroupBucketSize();
+ final String bucketPath =
HoodieTableMetadataUtil.getBucketRelativePath(relativePartitionPath,
bucketIndex);
+ return
Pair.of(HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i,
relativePartitionPath, dataPartitionName), bucketPath);
+ })
+ .collect(Collectors.toList());
+ } else {
+ msg = String.format("Creating file groups for MDT partition %s:
#fileGroups=%d, fileId=%s, instant_time=%s",
+ relativePartitionPath, fileGroupCount,
metadataPartition.getFileIdPrefix(), instantTime);
+ fileGroupIdAndPathPairList = IntStream.range(0, fileGroupCount)
+ // Since bucketing is disabled, the filegroups will be located
directly under the metadata partition directory
+ .mapToObj(i ->
Pair.of(HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i,
relativePartitionPath, dataPartitionName), relativePartitionPath))
+ .collect(Collectors.toList());
+ }
+
+ ValidationUtils.checkArgument(fileGroupIdAndPathPairList.size() ==
fileGroupCount);
LOG.info(msg);
- final List<String> fileGroupFileIds = IntStream.range(0, fileGroupCount)
- .mapToObj(i ->
HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i,
relativePartitionPath, dataPartitionName))
- .collect(Collectors.toList());
- ValidationUtils.checkArgument(fileGroupFileIds.size() == fileGroupCount);
engineContext.setJobStatus(this.getClass().getSimpleName(), msg);
- engineContext.foreach(fileGroupFileIds, fileGroupFileId -> {
+ engineContext.foreach(fileGroupIdAndPathPairList, p -> {
+ final String fileGroupFileId = p.getKey();
+ final String relativePath = p.getValue();
+
try {
final Map<HeaderMetadataType, String> blockHeader =
Collections.singletonMap(HeaderMetadataType.INSTANT_TIME, instantTime);
final HoodieDeleteBlock block = new
HoodieDeleteBlock(Collections.emptyList(), blockHeader);
try (HoodieLogFormat.Writer writer = HoodieLogFormat.newWriterBuilder()
-
.onParentPath(FSUtils.constructAbsolutePath(metadataWriteConfig.getBasePath(),
relativePartitionPath))
+
.onParentPath(FSUtils.constructAbsolutePath(metadataWriteConfig.getBasePath(),
relativePath))
.withFileId(fileGroupFileId)
.withInstantTime(instantTime)
.withLogVersion(HoodieLogFile.LOGFILE_BASE_VERSION)
.withFileSize(0L)
.withSizeThreshold(metadataWriteConfig.getLogFileMaxSize())
- .withStorage(dataMetaClient.getStorage())
+ .withStorage(storage)
.withLogWriteToken(HoodieLogFormat.DEFAULT_WRITE_TOKEN)
.withTableVersion(metadataWriteConfig.getWriteVersion())
.withFileExtension(HoodieLogFile.DELTA_EXTENSION).build()) {
writer.appendBlock(block);
}
} catch (InterruptedException e) {
- throw new HoodieException(String.format("Failed to created fileGroup
%s for partition %s", fileGroupFileId, relativePartitionPath), e);
+ throw new HoodieException(String.format("Failed to created fileGroup
%s for partition %s", fileGroupFileId, relativePath), e);
}
- }, fileGroupFileIds.size());
+ }, fileGroupIdAndPathPairList.size());
+
+ if (bucketingEnabled &&
!dataMetaClient.getTableConfig().isMetadataTablePartitionBucketingEnabled()) {
+ // Bucketing has been enabled so set it in the table config
+ dataMetaClient =
HoodieTableMetadataUtil.setMetadataTablePartitionBucketing(dataMetaClient,
bucketingEnabled);
+ }
Review Comment:
Fixed in 39755a7e: now assigns to `this.dataMetaClient` so subsequent
partition initializations in the same writer see the refreshed table config.
Added a comment at the call site to make the shadowing hazard explicit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]