The community has reported duplicate entries of the same data file in an Iceberg table. This is likely due to implementation bugs or some invalid operations (like adding file reference directly to the table multiple times). There are probably no valid reasons for having duplicate files in Iceberg metadata.
There were efforts of repairing manifest files to de-dup the data file entries or remove entries to non-existent files. The latest attempt is from Drew: https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v Previously, this just resulted in duplicate rows. With row lineage in V3, this can get more complicated. If the data file doesn't have persisted values for row-id, this behaves similarly as before. Every row would have a unique row-id although the data file entries are duplicated. But if the row-id values are persisted in the data file (like compaction rewrite), duplicate files can cause two active rows with the same row id. That would break the row id semantic. It can also lead to weird behavior with position delete or DV matching with data files with data sequence numbers. Should the spec call out that tables with duplicate files are considered having "incorrect" state? Thanks, Steven
