danny0405 commented on code in PR #18016:
URL: https://github.com/apache/hudi/pull/18016#discussion_r2748946724
##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java:
##########
@@ -490,6 +491,18 @@ public HashSet<String> getWritePartitionPaths() {
return new HashSet<>(partitionToWriteStats.keySet());
}
+ public Set<String> getWritePartitionPathsWithExistingFileGroupsModified() {
+ return getPartitionToWriteStats()
+ .entrySet()
+ .stream()
+ .filter(partitionAndWriteStats -> partitionAndWriteStats
+ .getValue()
+ .stream()
+ .anyMatch(writeStat ->
!Option.ofNullable(writeStat.getPrevCommit()).orElse("null").equalsIgnoreCase("null")))
Review Comment:
not sure what it means for checking `prevCommit` as null here? do you want
to check the scanario for initiating a new file group? I checked the existing
write handles and there are two special cases here:
1.
https://github.com/apache/hudi/blob/04326111808a5bc80f8cb2d5da2f75fa3dcf2091/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java#L206,
this handle is used for compaction, when all the files in the target file
slice are log files, the `prevCommit` is also set up as null, but we still need
to check whether to clean this file group because new file slice is generated,
this is also true for legacy compaction path, where create handle got used.
2.
https://github.com/apache/hudi/blob/04326111808a5bc80f8cb2d5da2f75fa3dcf2091/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java#L205,
for MOR log updates to existing file group, even if the `prevCommit` is
non-null always, we basically does not need cleaning because no new file slices
are generated
In general, it seems not right to just rely on the `prevCommit` to decide
whether the partition needs cleaning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]