What kind of stats do we produce for position delete files beyond the file
path and row positions? Are we dealing with a writer that persists the
entire row in the position delete file? So far we modified the writer in
Iceberg core to discard all bounds if a position delete file references
more than one data file (see here
<https://github.com/apache/iceberg/blob/b3adeb12e21c56d742c408f67c3cdb96b3e02ff0/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L121>
).

ср, 4 черв. 2025 р. о 14:05 Ryan Blue <rdb...@gmail.com> пише:

> I think we can discard column stats for position deletes, as long as the
> data file path is preserved (as it is in #13161). For position deletes, we
> need to preserve the stats for any equality ID columns. That reduces false
> positives by ensuring that the IDs being deleted might be in the data file
> the equality deletes are applied to.
>
> We should also take a look at how these files are written and possibly
> prevent the stats from being written at all. I think Anton updated position
> deletes to discard most column stats already.
>
> On Wed, Jun 4, 2025 at 10:09 AM Steven Wu <stevenz...@gmail.com> wrote:
>
>> It seems like a reasonable approach for DeleteFileIndex . I saw equality
>> delete file matching uses column stats. But it seems that column stats
>> (like lower/upper bounds) aren't used for associating position delete files
>> with a data file. Plus with file-scoped position delete files (V2),
>> matching won't need column stats too. With Delete Vector (DV) in V3, there
>> won't be column stats written for position deletes.
>>
>> On Tue, Jun 3, 2025 at 10:01 PM Yuya Ebihara <
>> yuya.ebih...@starburstdata.com> wrote:
>>
>>> Hi,
>>>
>>> I've been investigating an OOM issue during planning in the Trino
>>> coordinator, and I've found that the main cause is the column stats
>>> handling in the DeleteFileIndex class - it loads all delete files into
>>> memory.
>>> While rewriting delete files is one option, I'd like to explore reducing
>>> memory usage within the Iceberg library itself.
>>>
>>> I've opened a PR (#13161 <https://github.com/apache/iceberg/pull/13161>)
>>> that reduces memory usage on the Trino coordinator from 12.8 GB to 2.5 GB
>>> in my benchmark.
>>> The change copies only the file_path stats in DeleteFileIndex when the
>>> file is a positional delete.
>>>
>>> I'd appreciate your feedback on whether this is an acceptable approach,
>>> or if you have other suggestions.
>>> I understand that v4 will improve stats handling as part of #13153
>>> <https://github.com/apache/iceberg/issues/13153>, but in the Trino
>>> community, we're also interested in reducing memory usage for tables using
>>> formats earlier than v4.
>>>
>>> Thanks,
>>> Yuya
>>>
>>

Reply via email to