wombatu-kun commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3411489163
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -857,6 +883,32 @@ private int
estimateFileGroupCount(HoodieData<HoodieRecord> records) {
);
}
+ /**
+ * Sums row counts read from each base file's footer metadata, in parallel
via the engine context.
+ * Used in place of materializing and counting an RDD of records during RLI
bootstrap.
+ */
+ private long estimateRecordCountFromBaseFiles(List<Pair<String,
HoodieBaseFile>> partitionBaseFilePairs) {
+ if (partitionBaseFilePairs.isEmpty()) {
+ return 0L;
+ }
+ int parallelism = Math.min(partitionBaseFilePairs.size(),
+ dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+ StorageConfiguration<?> storageConfBroadcast = storageConf;
Review Comment:
storageConfBroadcast is the existing name for this exact lambda-capture
pattern at countRecordsInHFiles (HoodieBackedTableMetadataWriter.java:908),
which this method mirrors. Renaming only the new copy diverges from that; if
the Broadcast-implies-Spark concern is worth acting on, rename both occurrences
together, otherwise keeping the existing name is the consistent choice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]