> > Currently, readers are not required to raise an error. But with V3 row > lineage, it can have correctness implications for row_id uniqueness. Should > the language be stronger for V3 tables?
My concern with making this a requirement, is it puts a burden on readers to always do some aggregation of all scanned files, for very large tables this can have non-trivial overhead. We should maybe put the text someplace more central, so that it is very clear this should never be violated, but I think the onus is really on the writers here to not produce invalid tables. Cheers, Micah On Thu, Dec 18, 2025 at 3:13 PM Steven Wu <[email protected]> wrote: > Currently, readers are not required to raise an error. But with V3 row > lineage, it can have correctness implications for row_id uniqueness. Should > the language be stronger for V3 tables? > > On Thu, Dec 18, 2025 at 2:48 PM Steven Wu <[email protected]> wrote: > >> Micah, thanks a lot for the pointer. I missed it in the scan planning >> section. The language is pretty clear for scan planning. >> >> I guess the behavior is also undefined for file deletion. The Java >> implementation has two file deletion APIs: delete by name or delete by >> DataFile object. It may delete all references or just one reference. >> >> On Thu, Dec 18, 2025 at 2:38 PM Micah Kornfield <[email protected]> >> wrote: >> >>> I think the spec already does this [1]: >>> >>> "Note that for any snapshot, all file paths marked with "ADDED" or >>> "EXISTING" may appear at most once across all manifest files in the >>> snapshot. If a file path appears more than once, the results of the scan >>> are undefined. Reader implementations may raise an error in this case, but >>> are not required to do so." >>> >>> But maybe we should make the language clearer? >>> >>> Cheers, >>> Micah >>> >>> [1] >>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L852 >>> >>> >>> >>> On Thu, Dec 18, 2025 at 2:05 PM Steven Wu <[email protected]> wrote: >>> >>>> >>>> The community has reported duplicate entries of the same data file in >>>> an Iceberg table. This is likely due to implementation bugs or some invalid >>>> operations (like adding file reference directly to the table multiple >>>> times). There are probably no valid reasons for having duplicate files in >>>> Iceberg metadata. >>>> >>>> There were efforts of repairing manifest files to de-dup the data file >>>> entries or remove entries to non-existent files. The latest attempt is from >>>> Drew: >>>> https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v >>>> >>>> Previously, this just resulted in duplicate rows. With row lineage in >>>> V3, this can get more complicated. If the data file doesn't have persisted >>>> values for row-id, this behaves similarly as before. Every row would have a >>>> unique row-id although the data file entries are duplicated. But if the >>>> row-id values are persisted in the data file (like compaction rewrite), >>>> duplicate files can cause two active rows with the same row id. That would >>>> break the row id semantic. >>>> >>>> It can also lead to weird behavior with position delete or DV matching >>>> with data files with data sequence numbers. >>>> >>>> Should the spec call out that tables with duplicate files are >>>> considered having "incorrect" state? >>>> >>>> Thanks, >>>> Steven >>>> >>>
