Re: [DISCUSS] duplicate files

Steven Wu Thu, 18 Dec 2025 21:57:31 -0800

> My concern with making this a requirement, is it puts a burden on readers
to always do some aggregation of all scanned files, for very large tables
this can have non-trivial overhead.


This is a good point.

> We should maybe put the text someplace more central, so that it is very
clear this should never be violated, but I think the onus is really on the
writers here to not produce invalid tables.

Totally agree. It is probably good to call out writer behavior. maybe in
the implementation notes section?



On Thu, Dec 18, 2025 at 4:19 PM Micah Kornfield <[email protected]>
wrote:

> Currently, readers are not required to raise an error. But with V3 row
>> lineage, it can have correctness implications for row_id uniqueness. Should
>> the language be stronger for V3 tables?
>
>
>  My concern with making this a requirement, is it puts a burden on readers
> to always do some aggregation of all scanned files, for very large tables
> this can have non-trivial overhead.
>
> We should maybe put the text someplace more central, so that it is very
> clear this should never be violated, but I think the onus is really on the
> writers here to not produce invalid tables.
>
> Cheers,
> Micah
>
> On Thu, Dec 18, 2025 at 3:13 PM Steven Wu <[email protected]> wrote:
>
>> Currently, readers are not required to raise an error. But with V3 row
>> lineage, it can have correctness implications for row_id uniqueness. Should
>> the language be stronger for V3 tables?
>>
>> On Thu, Dec 18, 2025 at 2:48 PM Steven Wu <[email protected]> wrote:
>>
>>> Micah, thanks a lot for the pointer. I missed it in the scan planning
>>> section. The language is pretty clear for scan planning.
>>>
>>> I guess the behavior is also undefined for file deletion. The Java
>>> implementation has two file deletion APIs: delete by name or delete by
>>> DataFile object. It may delete all references or just one reference.
>>>
>>> On Thu, Dec 18, 2025 at 2:38 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> I think the spec already does this [1]:
>>>>
>>>> "Note that for any snapshot, all file paths marked with "ADDED" or
>>>> "EXISTING" may appear at most once across all manifest files in the
>>>> snapshot. If a file path appears more than once, the results of the scan
>>>> are undefined. Reader implementations may raise an error in this case, but
>>>> are not required to do so."
>>>>
>>>> But maybe we should make the language clearer?
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>> [1]
>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L852
>>>>
>>>>
>>>>
>>>> On Thu, Dec 18, 2025 at 2:05 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>>
>>>>> The community has reported duplicate entries of the same data file in
>>>>> an Iceberg table. This is likely due to implementation bugs or some 
>>>>> invalid
>>>>> operations (like adding file reference directly to the table multiple
>>>>> times). There are probably no valid reasons for having duplicate files in
>>>>> Iceberg metadata.
>>>>>
>>>>> There were efforts of repairing manifest files to de-dup the data file
>>>>> entries or remove entries to non-existent files. The latest attempt is 
>>>>> from
>>>>> Drew:
>>>>> https://lists.apache.org/thread/7ydj3fxtydymfwbcfm16toqmq64xnw1v
>>>>>
>>>>> Previously, this just resulted in duplicate rows. With row lineage in
>>>>> V3, this can get more complicated. If the data file doesn't have persisted
>>>>> values for row-id, this behaves similarly as before. Every row would have 
>>>>> a
>>>>> unique row-id although the data file entries are duplicated. But if the
>>>>> row-id values are persisted in the data file (like compaction rewrite),
>>>>> duplicate files can cause two active rows with the same row id. That would
>>>>> break the row id semantic.
>>>>>
>>>>> It can also lead to weird behavior with position delete or DV matching
>>>>> with data files with data sequence numbers.
>>>>>
>>>>> Should the spec call out that tables with duplicate files are
>>>>> considered having "incorrect" state?
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>

Re: [DISCUSS] duplicate files

Reply via email to