What kind of stats do we produce for position delete files beyond the file path and row positions? Are we dealing with a writer that persists the entire row in the position delete file? So far we modified the writer in Iceberg core to discard all bounds if a position delete file references more than one data file (see here <https://github.com/apache/iceberg/blob/b3adeb12e21c56d742c408f67c3cdb96b3e02ff0/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L121> ).
ср, 4 черв. 2025 р. о 14:05 Ryan Blue <rdb...@gmail.com> пише: > I think we can discard column stats for position deletes, as long as the > data file path is preserved (as it is in #13161). For position deletes, we > need to preserve the stats for any equality ID columns. That reduces false > positives by ensuring that the IDs being deleted might be in the data file > the equality deletes are applied to. > > We should also take a look at how these files are written and possibly > prevent the stats from being written at all. I think Anton updated position > deletes to discard most column stats already. > > On Wed, Jun 4, 2025 at 10:09 AM Steven Wu <stevenz...@gmail.com> wrote: > >> It seems like a reasonable approach for DeleteFileIndex . I saw equality >> delete file matching uses column stats. But it seems that column stats >> (like lower/upper bounds) aren't used for associating position delete files >> with a data file. Plus with file-scoped position delete files (V2), >> matching won't need column stats too. With Delete Vector (DV) in V3, there >> won't be column stats written for position deletes. >> >> On Tue, Jun 3, 2025 at 10:01 PM Yuya Ebihara < >> yuya.ebih...@starburstdata.com> wrote: >> >>> Hi, >>> >>> I've been investigating an OOM issue during planning in the Trino >>> coordinator, and I've found that the main cause is the column stats >>> handling in the DeleteFileIndex class - it loads all delete files into >>> memory. >>> While rewriting delete files is one option, I'd like to explore reducing >>> memory usage within the Iceberg library itself. >>> >>> I've opened a PR (#13161 <https://github.com/apache/iceberg/pull/13161>) >>> that reduces memory usage on the Trino coordinator from 12.8 GB to 2.5 GB >>> in my benchmark. >>> The change copies only the file_path stats in DeleteFileIndex when the >>> file is a positional delete. >>> >>> I'd appreciate your feedback on whether this is an acceptable approach, >>> or if you have other suggestions. >>> I understand that v4 will improve stats handling as part of #13153 >>> <https://github.com/apache/iceberg/issues/13153>, but in the Trino >>> community, we're also interested in reducing memory usage for tables using >>> formats earlier than v4. >>> >>> Thanks, >>> Yuya >>> >>