Micah, thanks a lot for the pointer. I missed it in the scan planning section. The language is pretty clear for scan planning.
I guess the behavior is also undefined for file deletion. The Java implementation has two file deletion APIs: delete by name or delete by DataFile object. It may delete all references or just one reference. On Thu, Dec 18, 2025 at 2:38 PM Micah Kornfield <[email protected]> wrote: > I think the spec already does this [1]: > > "Note that for any snapshot, all file paths marked with "ADDED" or > "EXISTING" may appear at most once across all manifest files in the > snapshot. If a file path appears more than once, the results of the scan > are undefined. Reader implementations may raise an error in this case, but > are not required to do so." > > But maybe we should make the language clearer? > > Cheers, > Micah > > [1] > https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L852 > > > > On Thu, Dec 18, 2025 at 2:05 PM Steven Wu <[email protected]> wrote: > >> >> The community has reported duplicate entries of the same data file in an >> Iceberg table. This is likely due to implementation bugs or some invalid >> operations (like adding file reference directly to the table multiple >> times). There are probably no valid reasons for having duplicate files in >> Iceberg metadata. >> >> There were efforts of repairing manifest files to de-dup the data file >> entries or remove entries to non-existent files. The latest attempt is from >> Drew: >> https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v >> >> Previously, this just resulted in duplicate rows. With row lineage in V3, >> this can get more complicated. If the data file doesn't have persisted >> values for row-id, this behaves similarly as before. Every row would have a >> unique row-id although the data file entries are duplicated. But if the >> row-id values are persisted in the data file (like compaction rewrite), >> duplicate files can cause two active rows with the same row id. That would >> break the row id semantic. >> >> It can also lead to weird behavior with position delete or DV matching >> with data files with data sequence numbers. >> >> Should the spec call out that tables with duplicate files are >> considered having "incorrect" state? >> >> Thanks, >> Steven >> >
