codope commented on code in PR #18826:
URL: https://github.com/apache/hudi/pull/18826#discussion_r3485323310
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -826,28 +842,58 @@ private Pair<Integer, HoodieData<HoodieRecord>>
initializeRecordIndexPartition(
dataMetaClient,
dataWriteConfig);
- // Initialize the file groups
- final int fileGroupCount = estimateFileGroupCount(records);
+ Pair<Integer, Integer> bounds = getRLIFileGroupCountBounds();
+ int minFileGroupCount = bounds.getLeft();
+ int maxFileGroupCount = bounds.getRight();
+
+ int fileGroupCount;
+ if (minFileGroupCount != maxFileGroupCount) {
+ // Estimate file group count based on record count read from base file
footer metadata.
+ // Avoids the expensive records.persist() + records.count() pass over
materialized records.
+ // Filtering to base-file-only slices is safe for RLI: log files in a
slice can only carry
+ // updates or deletes for existing records, never new inserts (RLI is
keyed by record key
+ // and updates rewrite the entry rather than add one), so they do not
contribute to the
+ // distinct-record-count estimate that sizes the file groups.
+ List<Pair<String, HoodieBaseFile>> partitionBaseFilePairs =
latestMergedPartitionFileSliceList.stream()
Review Comment:
let's ensure we track this folllowup because Flink MOR streaming writes
inserts directly to log files right. I am also thinking to support non-blocking
clustering, we can perhaps write inserts to log files (however that's TBD).
Flink use case is more relevant. Also, uncompacted file slice with only log
file silently contribute 0. So, RLI is inconsistent for some corner cases. We
will have to fix it soon.
cc: @danny0405
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -826,28 +842,58 @@ private Pair<Integer, HoodieData<HoodieRecord>>
initializeRecordIndexPartition(
dataMetaClient,
dataWriteConfig);
- // Initialize the file groups
- final int fileGroupCount = estimateFileGroupCount(records);
+ Pair<Integer, Integer> bounds = getRLIFileGroupCountBounds();
+ int minFileGroupCount = bounds.getLeft();
+ int maxFileGroupCount = bounds.getRight();
+
+ int fileGroupCount;
+ if (minFileGroupCount != maxFileGroupCount) {
+ // Estimate file group count based on record count read from base file
footer metadata.
+ // Avoids the expensive records.persist() + records.count() pass over
materialized records.
+ // Filtering to base-file-only slices is safe for RLI: log files in a
slice can only carry
+ // updates or deletes for existing records, never new inserts (RLI is
keyed by record key
+ // and updates rewrite the entry rather than add one), so they do not
contribute to the
+ // distinct-record-count estimate that sizes the file groups.
+ List<Pair<String, HoodieBaseFile>> partitionBaseFilePairs =
latestMergedPartitionFileSliceList.stream()
Review Comment:
As a side note, are we going to deprecate avro log files in favor
columnar/parquet completely? That might simplify the above problem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]