codope commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3292757484


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -857,6 +883,32 @@ private int 
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
     );
   }
 
+  /**
+   * Sums row counts read from each base file's footer metadata, in parallel 
via the engine context.
+   * Used in place of materializing and counting an RDD of records during RLI 
bootstrap.
+   */
+  private long estimateRecordCountFromBaseFiles(List<Pair<String, 
HoodieBaseFile>> partitionBaseFilePairs) {
+    if (partitionBaseFilePairs.isEmpty()) {
+      return 0L;
+    }
+    int parallelism = Math.min(partitionBaseFilePairs.size(),
+        dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+    StorageConfiguration<?> storageConfBroadcast = storageConf;
+    return engineContext.parallelize(partitionBaseFilePairs, parallelism)
+        .map(partitionAndBaseFile -> {
+          HoodieBaseFile baseFile = partitionAndBaseFile.getValue();
+          StoragePath path = baseFile.getStoragePath();
+          try {
+            HoodieStorage storage = HoodieStorageUtils.getStorage(path, 
storageConfBroadcast);
+            return 
HoodieIOFactory.getIOFactory(storage).getFileFormatUtils(path).getRowCount(storage,
 path);
+          } catch (Exception e) {
+            LOG.warn("Failed to read row count from base file footer: {}", 
path, e);
+            return 0L;

Review Comment:
   hmm.. the bot raises a good point. Shall we match `countRecordsInHFiles` 
behaviour?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to