The community has reported duplicate entries of the same data file in an
Iceberg table. This is likely due to implementation bugs or some invalid
operations (like adding file reference directly to the table multiple
times). There are probably no valid reasons for having duplicate files in
Iceberg metadata.

There were efforts of repairing manifest files to de-dup the data file
entries or remove entries to non-existent files. The latest attempt is from
Drew:
https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v

Previously, this just resulted in duplicate rows. With row lineage in V3,
this can get more complicated. If the data file doesn't have persisted
values for row-id, this behaves similarly as before. Every row would have a
unique row-id although the data file entries are duplicated. But if the
row-id values are persisted in the data file (like compaction rewrite),
duplicate files can cause two active rows with the same row id. That would
break the row id semantic.

It can also lead to weird behavior with position delete or DV matching with
data files with data sequence numbers.

Should the spec call out that tables with duplicate files are
considered having "incorrect" state?

Thanks,
Steven

Reply via email to