Re: [DISCUSS] duplicate files

Micah Kornfield Thu, 18 Dec 2025 14:38:28 -0800

I think the spec already does this [1]:

"Note that for any snapshot, all file paths marked with "ADDED" or
"EXISTING" may appear at most once across all manifest files in the
snapshot. If a file path appears more than once, the results of the scan
are undefined. Reader implementations may raise an error in this case, but
are not required to do so."


But maybe we should make the language clearer?

Cheers,
Micah

[1] https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L852



On Thu, Dec 18, 2025 at 2:05 PM Steven Wu <[email protected]> wrote:

>
> The community has reported duplicate entries of the same data file in an
> Iceberg table. This is likely due to implementation bugs or some invalid
> operations (like adding file reference directly to the table multiple
> times). There are probably no valid reasons for having duplicate files in
> Iceberg metadata.
>
> There were efforts of repairing manifest files to de-dup the data file
> entries or remove entries to non-existent files. The latest attempt is from
> Drew:
> https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v
>
> Previously, this just resulted in duplicate rows. With row lineage in V3,
> this can get more complicated. If the data file doesn't have persisted
> values for row-id, this behaves similarly as before. Every row would have a
> unique row-id although the data file entries are duplicated. But if the
> row-id values are persisted in the data file (like compaction rewrite),
> duplicate files can cause two active rows with the same row id. That would
> break the row id semantic.
>
> It can also lead to weird behavior with position delete or DV matching
> with data files with data sequence numbers.
>
> Should the spec call out that tables with duplicate files are
> considered having "incorrect" state?
>
> Thanks,
> Steven
>

Re: [DISCUSS] duplicate files

Reply via email to