nbalajee commented on a change in pull request #3426:
URL: https://github.com/apache/hudi/pull/3426#discussion_r686405376
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -381,6 +387,57 @@ private boolean
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
return partitionToFileStatus;
}
+ /**
+ * Initialize shards for a partition.
+ *
+ * Each shard is a single log file with the following format:
+ * <fileIdPrefix>ABCD
+ * where ABCD are digits. This allows up to 9999 shards.
+ *
+ * Example:
+ * fc9f18eb-6049-4f47-bc51-23884bef0001
+ * fc9f18eb-6049-4f47-bc51-23884bef0002
+ */
+ private void initializeShards(HoodieTableMetaClient datasetMetaClient,
String partition, String instantTime,
+ int shardCount) throws IOException {
+ ValidationUtils.checkArgument(shardCount <= 9999, "Maximum 9999 shards are
supported.");
+
+ final String newFileId = FSUtils.createNewFileIdPfx();
+ final String newFileIdPrefix = newFileId.substring(0, 32);
+ final HashMap<HeaderMetadataType, String> blockHeader = new HashMap<>();
+ blockHeader.put(HeaderMetadataType.INSTANT_TIME, instantTime);
+ final HoodieDeleteBlock block = new HoodieDeleteBlock(new HoodieKey[0],
blockHeader);
+
+ LOG.info(String.format("Creating %d shards for partition %s with base
fileId %s at instant time %s",
+ shardCount, partition, newFileId, instantTime));
+ for (int i = 0; i < shardCount; ++i) {
+ // Generate a indexed fileId for each shard and write a log block into
it to create the file.
+ final String shardFileId = String.format("%s%04d", newFileIdPrefix, i +
1);
+ ValidationUtils.checkArgument(newFileId.length() ==
shardFileId.length(), "FileId should be of length " + newFileId.length());
+ try {
+ HoodieLogFormat.Writer writer = HoodieLogFormat.newWriterBuilder()
+
.onParentPath(FSUtils.getPartitionPath(metadataWriteConfig.getBasePath(),
partition))
+ .withFileId(shardFileId).overBaseCommit(instantTime)
+ .withLogVersion(HoodieLogFile.LOGFILE_BASE_VERSION)
+ .withFileSize(0L)
+ .withSizeThreshold(metadataWriteConfig.getLogFileMaxSize())
+ .withFs(datasetMetaClient.getFs())
+ .withRolloverLogWriteToken(FSUtils.makeWriteToken(0, 0, 0))
+ .withLogWriteToken(FSUtils.makeWriteToken(0, 0, 0))
+ .withFileExtension(HoodieLogFile.DELTA_EXTENSION).build();
Review comment:
Are there any advantages to creating a log file vs the base file here?
##########
File path:
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -340,4 +345,46 @@ private static void
processRollbackMetadata(HoodieRollbackMetadata rollbackMetad
return records;
}
+
+ /**
+ * Map a key to a shard.
+ *
+ * Note: For hashing, the algorithm is same as String.hashCode() but is
being defined here as hashCode()
+ * implementation is not guaranteed by the JVM to be consistent across JVM
versions and implementations.
+ *
+ * @param str
+ * @return An integer hash of the given string
+ */
+ public static int keyToShard(String str, int numShards) {
+ int h = 0;
Review comment:
With a 6+ character key, int would overflow. Should this be a long?
Also, should keyToShard be customizable with other hash functions?
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -381,6 +387,57 @@ private boolean
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
return partitionToFileStatus;
}
+ /**
+ * Initialize shards for a partition.
+ *
+ * Each shard is a single log file with the following format:
+ * <fileIdPrefix>ABCD
+ * where ABCD are digits. This allows up to 9999 shards.
+ *
+ * Example:
+ * fc9f18eb-6049-4f47-bc51-23884bef0001
+ * fc9f18eb-6049-4f47-bc51-23884bef0002
Review comment:
When an executor receives more than one file's worth of records, we end
up creating f1-0_wt_cx.parquet, f1-1_wt2_cx.parquet etc. Should the same
naming convention used here?
In other words, should you use the 36 byte fileId prefix + "-ABCD"?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]