Re: [DISCUSS] duplicate files

Micah Kornfield Thu, 18 Dec 2025 16:19:33 -0800

>
> Currently, readers are not required to raise an error. But with V3 row
> lineage, it can have correctness implications for row_id uniqueness. Should
> the language be stronger for V3 tables?



 My concern with making this a requirement, is it puts a burden on readers
to always do some aggregation of all scanned files, for very large tables
this can have non-trivial overhead.

We should maybe put the text someplace more central, so that it is very
clear this should never be violated, but I think the onus is really on the
writers here to not produce invalid tables.

Cheers,
Micah

On Thu, Dec 18, 2025 at 3:13 PM Steven Wu <[email protected]> wrote:

> Currently, readers are not required to raise an error. But with V3 row
> lineage, it can have correctness implications for row_id uniqueness. Should
> the language be stronger for V3 tables?
>
> On Thu, Dec 18, 2025 at 2:48 PM Steven Wu <[email protected]> wrote:
>
>> Micah, thanks a lot for the pointer. I missed it in the scan planning
>> section. The language is pretty clear for scan planning.
>>
>> I guess the behavior is also undefined for file deletion. The Java
>> implementation has two file deletion APIs: delete by name or delete by
>> DataFile object. It may delete all references or just one reference.
>>
>> On Thu, Dec 18, 2025 at 2:38 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> I think the spec already does this [1]:
>>>
>>> "Note that for any snapshot, all file paths marked with "ADDED" or
>>> "EXISTING" may appear at most once across all manifest files in the
>>> snapshot. If a file path appears more than once, the results of the scan
>>> are undefined. Reader implementations may raise an error in this case, but
>>> are not required to do so."
>>>
>>> But maybe we should make the language clearer?
>>>
>>> Cheers,
>>> Micah
>>>
>>> [1]
>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L852
>>>
>>>
>>>
>>> On Thu, Dec 18, 2025 at 2:05 PM Steven Wu <[email protected]> wrote:
>>>
>>>>
>>>> The community has reported duplicate entries of the same data file in
>>>> an Iceberg table. This is likely due to implementation bugs or some invalid
>>>> operations (like adding file reference directly to the table multiple
>>>> times). There are probably no valid reasons for having duplicate files in
>>>> Iceberg metadata.
>>>>
>>>> There were efforts of repairing manifest files to de-dup the data file
>>>> entries or remove entries to non-existent files. The latest attempt is from
>>>> Drew:
>>>> https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v
>>>>
>>>> Previously, this just resulted in duplicate rows. With row lineage in
>>>> V3, this can get more complicated. If the data file doesn't have persisted
>>>> values for row-id, this behaves similarly as before. Every row would have a
>>>> unique row-id although the data file entries are duplicated. But if the
>>>> row-id values are persisted in the data file (like compaction rewrite),
>>>> duplicate files can cause two active rows with the same row id. That would
>>>> break the row id semantic.
>>>>
>>>> It can also lead to weird behavior with position delete or DV matching
>>>> with data files with data sequence numbers.
>>>>
>>>> Should the spec call out that tables with duplicate files are
>>>> considered having "incorrect" state?
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>

Re: [DISCUSS] duplicate files

Reply via email to