codope commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3485323310


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -826,28 +842,58 @@ private Pair<Integer, HoodieData<HoodieRecord>> 
initializeRecordIndexPartition(
         dataMetaClient,
         dataWriteConfig);
 
-    // Initialize the file groups
-    final int fileGroupCount = estimateFileGroupCount(records);
+    Pair<Integer, Integer> bounds = getRLIFileGroupCountBounds();
+    int minFileGroupCount = bounds.getLeft();
+    int maxFileGroupCount = bounds.getRight();
+
+    int fileGroupCount;
+    if (minFileGroupCount != maxFileGroupCount) {
+      // Estimate file group count based on record count read from base file 
footer metadata.
+      // Avoids the expensive records.persist() + records.count() pass over 
materialized records.
+      // Filtering to base-file-only slices is safe for RLI: log files in a 
slice can only carry
+      // updates or deletes for existing records, never new inserts (RLI is 
keyed by record key
+      // and updates rewrite the entry rather than add one), so they do not 
contribute to the
+      // distinct-record-count estimate that sizes the file groups.
+      List<Pair<String, HoodieBaseFile>> partitionBaseFilePairs = 
latestMergedPartitionFileSliceList.stream()

Review Comment:
   let's ensure we track this folllowup because Flink MOR streaming writes 
inserts directly to log files right. I am also thinking to support non-blocking 
clustering, we can perhaps write inserts to log files (however that's TBD). 
Flink use case is more relevant. Also, uncompacted file slice with only log 
file silently contribute 0. So, RLI is inconsistent for some corner cases. We 
will have to fix it soon.
   cc: @danny0405 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -826,28 +842,58 @@ private Pair<Integer, HoodieData<HoodieRecord>> 
initializeRecordIndexPartition(
         dataMetaClient,
         dataWriteConfig);
 
-    // Initialize the file groups
-    final int fileGroupCount = estimateFileGroupCount(records);
+    Pair<Integer, Integer> bounds = getRLIFileGroupCountBounds();
+    int minFileGroupCount = bounds.getLeft();
+    int maxFileGroupCount = bounds.getRight();
+
+    int fileGroupCount;
+    if (minFileGroupCount != maxFileGroupCount) {
+      // Estimate file group count based on record count read from base file 
footer metadata.
+      // Avoids the expensive records.persist() + records.count() pass over 
materialized records.
+      // Filtering to base-file-only slices is safe for RLI: log files in a 
slice can only carry
+      // updates or deletes for existing records, never new inserts (RLI is 
keyed by record key
+      // and updates rewrite the entry rather than add one), so they do not 
contribute to the
+      // distinct-record-count estimate that sizes the file groups.
+      List<Pair<String, HoodieBaseFile>> partitionBaseFilePairs = 
latestMergedPartitionFileSliceList.stream()

Review Comment:
   As a side note, are we going to deprecate avro log files in favor 
columnar/parquet completely? That might simplify the above problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to