amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2729978590
##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -265,15 +268,14 @@ private void addInternal(DeleteFile file) {
"Cannot find partition spec %s for delete file: %s",
file.specId(),
file.location());
-
- DeleteFileSet deleteFiles =
- newDeleteFilesBySpec.computeIfAbsent(spec.specId(), ignored ->
DeleteFileSet.create());
- if (deleteFiles.add(file)) {
- addedFilesSummary.addedFile(spec, file);
- hasNewDeleteFiles = true;
- if (ContentFileUtil.isDV(file)) {
- newDVRefs.add(file.referencedDataFile());
- }
+ hasNewDeleteFiles = true;
Review Comment:
Since we're not tracking by DeleteFileSet at the time of adding, we treat
every addition as a new delete, even potential duplicates (unless we want to do
a look back in the list on every addDeleteFile, but I'm very against that since
it's an O(deletes-added^2) operation effectively at that point for a commit).
If we look at how `hasNewDeleteFiles` is actually used, I don't think this
is really consequential. hasNewDeleteFiles is true and there's a cached state
we use the flag as an indication that the cache is stale, and should be cleared
out/files cleaned up. Even if there are duplicates, there's at least 1 file
which is new.
We end up merging/deduping the DVs (and the V2 pos deletes and equality
deletes) anyways just before producing new manifests. See my comment below
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]