amogh-jahagirdar commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2687748776


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -87,6 +96,7 @@ abstract class MergingSnapshotProducer<ThisT> extends 
SnapshotProducer<ThisT> {
   private final Map<Integer, DataFileSet> newDataFilesBySpec = 
Maps.newHashMap();
   private Long newDataFilesDataSequenceNumber;
   private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec = 
Maps.newHashMap();
+  private final Map<String, DeleteFileSet> duplicateDVsForDataFile = 
Maps.newHashMap();

Review Comment:
   >Why would it be 2 * Deletefiles? We would just use the value set as the 
list of all delete files?
   
   Yeah I just meant that it'd be 2 * delete files if we had both 
deleteFilesBySpec and dvRefs as a map. We could split the tracking so that 
deleteFilesBySpec only tracks eq. deletes and v2 position deletes, and dvRefs 
is only refs. Then it's just O(deletes), but then separating the tracking also 
means some changes in other places. 
   
   >I guess my thought here, is if we are going to spend the time to clean it 
up we may as well just do String, DeleteFiles map. If we were just going to 
throw an error I would keep newDVRefs as a set.
   
   I'm good either way w slight preference to tracking just the duplicates!
   
   I do think we should do a fixup rather than error because even in the case 
the fixup is expensive, I look at it as it incentivizes engines to try and 
avoid the problem on distributed write, or they do their own fixup before 
handing off to the commit path. Then we sort of get the best of protecting the 
table state and perf, without having to cause failures? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to