Re: [PR] feat(metadata): optimize RLI bootstrap by sizing file groups from base file footer row counts [hudi]

via GitHub Sun, 14 Jun 2026 23:45:54 -0700


wombatu-kun commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3411489163



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -857,6 +883,32 @@ private int 
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
     );
   }
 
+  /**
+   * Sums row counts read from each base file's footer metadata, in parallel 
via the engine context.
+   * Used in place of materializing and counting an RDD of records during RLI 
bootstrap.
+   */
+  private long estimateRecordCountFromBaseFiles(List<Pair<String, 
HoodieBaseFile>> partitionBaseFilePairs) {
+    if (partitionBaseFilePairs.isEmpty()) {
+      return 0L;
+    }
+    int parallelism = Math.min(partitionBaseFilePairs.size(),
+        dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+    StorageConfiguration<?> storageConfBroadcast = storageConf;

Review Comment:
   storageConfBroadcast is the existing name for this exact lambda-capture 
pattern at countRecordsInHFiles (HoodieBackedTableMetadataWriter.java:908), 
which this method mirrors. Renaming only the new copy diverges from that; if 
the Broadcast-implies-Spark concern is worth acting on, rename both occurrences 
together, otherwise keeping the existing name is the consistent choice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(metadata): optimize RLI bootstrap by sizing file groups from base file footer row counts [hudi]

Reply via email to