rdblue commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2688017581
##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -87,6 +96,7 @@ abstract class MergingSnapshotProducer<ThisT> extends
SnapshotProducer<ThisT> {
private final Map<Integer, DataFileSet> newDataFilesBySpec =
Maps.newHashMap();
private Long newDataFilesDataSequenceNumber;
private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec =
Maps.newHashMap();
+ private final Map<String, DeleteFileSet> duplicateDVsForDataFile =
Maps.newHashMap();
Review Comment:
I agree with Russell here. I think it would be better to keep track of the
DVs in a different way.
Now that there may be duplicate DVs that get rewritten, we can no longer
trust `newDeleteFilesBySpec` because it could contain DVs that will be replaced
with compacted ones. I think what I would do is a bit simpler: I would keep a
multimap of `DeleteFile` by referenced data file (location). Then in
`newDeleteFilesAsManifests`, you would loop through each referenced data file
and process the `DeleteFile` entries. If there is more than one then you'd
create a new DV and write its metadata to a new file and then write the new DV
metadata to a manifest writer (by spec ID). If it's important to dedup before
compacting, then you could keep a single `DeleteFileSet` and clear it before
processing each referenced file's deletes.
I think that's slightly more reliable in a couple ways. For instance, it
would catch DVs with different (but possibly equivalent) specs and it would
also catch cases where you have a v2 delete file and a DV, which should also be
compacted. It's also simpler to keep state in just one place.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]