[GitHub] [iceberg] rdblue commented on pull request #1739: Allow duplicate files to be deduplicated when performing scans

GitBox Mon, 09 Nov 2020 15:54:19 -0800


rdblue commented on pull request #1739:
URL: https://github.com/apache/iceberg/pull/1739#issuecomment-724354430



   I'm not sure that we want to do this. The issue in #1511 was that the same 
delete file needed to be applied to multiple data files in a task, which caused 
two identical entries in a map. That case had two valid references to the same 
file, not duplicate files.
   
   Iceberg doesn't second-guess changes made through its API. If you add the 
same file twice, then you get duplicate data rows. It isn't feasible for 
Iceberg to deduplicate data files when appending, and I'm not sure it is a good 
idea to do this when reading. Even with a flag to turn on deduplication, you'd 
need to instruct all readers to deduplicate files.
   
   I would rather fix the process writing duplicates to Iceberg. You should be 
able to check whether a file or batch of files is already committed. That's 
what the Flink sink does when it recovers from a checkpoint -- it checks 
whether a given checkpoint's files were added in a commit using snapshot 
metadata. Can you use the same approach to avoid duplicates?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1739: Allow duplicate files to be deduplicated when performing scans

Reply via email to