rdblue commented on pull request #1739: URL: https://github.com/apache/iceberg/pull/1739#issuecomment-724354430
I'm not sure that we want to do this. The issue in #1511 was that the same delete file needed to be applied to multiple data files in a task, which caused two identical entries in a map. That case had two valid references to the same file, not duplicate files. Iceberg doesn't second-guess changes made through its API. If you add the same file twice, then you get duplicate data rows. It isn't feasible for Iceberg to deduplicate data files when appending, and I'm not sure it is a good idea to do this when reading. Even with a flag to turn on deduplication, you'd need to instruct all readers to deduplicate files. I would rather fix the process writing duplicates to Iceberg. You should be able to check whether a file or batch of files is already committed. That's what the Flink sink does when it recovers from a checkpoint -- it checks whether a given checkpoint's files were added in a commit using snapshot metadata. Can you use the same approach to avoid duplicates? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
