nbalajee commented on a change in pull request #3426:
URL: https://github.com/apache/hudi/pull/3426#discussion_r686405376



##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -381,6 +387,57 @@ private boolean 
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
     return partitionToFileStatus;
   }
 
+  /**
+   * Initialize shards for a partition.
+   *
+   * Each shard is a single log file with the following format:
+   *    <fileIdPrefix>ABCD
+   * where ABCD are digits. This allows up to 9999 shards.
+   *
+   * Example:
+   *    fc9f18eb-6049-4f47-bc51-23884bef0001
+   *    fc9f18eb-6049-4f47-bc51-23884bef0002
+   */
+  private void initializeShards(HoodieTableMetaClient datasetMetaClient, 
String partition, String instantTime,
+      int shardCount) throws IOException {
+    ValidationUtils.checkArgument(shardCount <= 9999, "Maximum 9999 shards are 
supported.");
+
+    final String newFileId = FSUtils.createNewFileIdPfx();
+    final String newFileIdPrefix = newFileId.substring(0, 32);
+    final HashMap<HeaderMetadataType, String> blockHeader = new HashMap<>();
+    blockHeader.put(HeaderMetadataType.INSTANT_TIME, instantTime);
+    final HoodieDeleteBlock block = new HoodieDeleteBlock(new HoodieKey[0], 
blockHeader);
+
+    LOG.info(String.format("Creating %d shards for partition %s with base 
fileId %s at instant time %s",
+        shardCount, partition, newFileId, instantTime));
+    for (int i = 0; i < shardCount; ++i) {
+      // Generate a indexed fileId for each shard and write a log block into 
it to create the file.
+      final String shardFileId = String.format("%s%04d", newFileIdPrefix, i + 
1);
+      ValidationUtils.checkArgument(newFileId.length() == 
shardFileId.length(), "FileId should be of length " + newFileId.length());
+      try {
+        HoodieLogFormat.Writer writer = HoodieLogFormat.newWriterBuilder()
+            
.onParentPath(FSUtils.getPartitionPath(metadataWriteConfig.getBasePath(), 
partition))
+            .withFileId(shardFileId).overBaseCommit(instantTime)
+            .withLogVersion(HoodieLogFile.LOGFILE_BASE_VERSION)
+            .withFileSize(0L)
+            .withSizeThreshold(metadataWriteConfig.getLogFileMaxSize())
+            .withFs(datasetMetaClient.getFs())
+            .withRolloverLogWriteToken(FSUtils.makeWriteToken(0, 0, 0))
+            .withLogWriteToken(FSUtils.makeWriteToken(0, 0, 0))
+            .withFileExtension(HoodieLogFile.DELTA_EXTENSION).build();

Review comment:
       Are there any advantages to creating a log file vs the base file here?

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -340,4 +345,46 @@ private static void 
processRollbackMetadata(HoodieRollbackMetadata rollbackMetad
 
     return records;
   }
+
+  /**
+   * Map a key to a shard.
+   *
+   * Note: For hashing, the algorithm is same as String.hashCode() but is 
being defined here as hashCode()
+   * implementation is not guaranteed by the JVM to be consistent across JVM 
versions and implementations.
+   *
+   * @param str
+   * @return An integer hash of the given string
+   */
+  public static int keyToShard(String str, int numShards) {
+    int h = 0;

Review comment:
       With a 6+ character key, int would overflow. Should this be a long?
   
   Also, should keyToShard be customizable with other hash functions?

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -381,6 +387,57 @@ private boolean 
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
     return partitionToFileStatus;
   }
 
+  /**
+   * Initialize shards for a partition.
+   *
+   * Each shard is a single log file with the following format:
+   *    <fileIdPrefix>ABCD
+   * where ABCD are digits. This allows up to 9999 shards.
+   *
+   * Example:
+   *    fc9f18eb-6049-4f47-bc51-23884bef0001
+   *    fc9f18eb-6049-4f47-bc51-23884bef0002

Review comment:
       When an executor receives more than one file's worth of records, we end 
up creating f1-0_wt_cx.parquet, f1-1_wt2_cx.parquet etc.   Should the same 
naming convention used here?  
   
   In other words, should you use the 36 byte fileId prefix + "-ABCD"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to