Re: [DISCUSS] duplicate files

Steven Wu Thu, 18 Dec 2025 14:49:21 -0800

Micah, thanks a lot for the pointer. I missed it in the scan planning
section. The language is pretty clear for scan planning.


I guess the behavior is also undefined for file deletion. The Java
implementation has two file deletion APIs: delete by name or delete by
DataFile object. It may delete all references or just one reference.

On Thu, Dec 18, 2025 at 2:38 PM Micah Kornfield <[email protected]>
wrote:

> I think the spec already does this [1]:
>
> "Note that for any snapshot, all file paths marked with "ADDED" or
> "EXISTING" may appear at most once across all manifest files in the
> snapshot. If a file path appears more than once, the results of the scan
> are undefined. Reader implementations may raise an error in this case, but
> are not required to do so."
>
> But maybe we should make the language clearer?
>
> Cheers,
> Micah
>
> [1]
> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L852
>
>
>
> On Thu, Dec 18, 2025 at 2:05 PM Steven Wu <[email protected]> wrote:
>
>>
>> The community has reported duplicate entries of the same data file in an
>> Iceberg table. This is likely due to implementation bugs or some invalid
>> operations (like adding file reference directly to the table multiple
>> times). There are probably no valid reasons for having duplicate files in
>> Iceberg metadata.
>>
>> There were efforts of repairing manifest files to de-dup the data file
>> entries or remove entries to non-existent files. The latest attempt is from
>> Drew:
>> https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v
>>
>> Previously, this just resulted in duplicate rows. With row lineage in V3,
>> this can get more complicated. If the data file doesn't have persisted
>> values for row-id, this behaves similarly as before. Every row would have a
>> unique row-id although the data file entries are duplicated. But if the
>> row-id values are persisted in the data file (like compaction rewrite),
>> duplicate files can cause two active rows with the same row id. That would
>> break the row id semantic.
>>
>> It can also lead to weird behavior with position delete or DV matching
>> with data files with data sequence numbers.
>>
>> Should the spec call out that tables with duplicate files are
>> considered having "incorrect" state?
>>
>> Thanks,
>> Steven
>>
>

Re: [DISCUSS] duplicate files

Reply via email to