It seems like a reasonable approach for DeleteFileIndex . I saw equality delete file matching uses column stats. But it seems that column stats (like lower/upper bounds) aren't used for associating position delete files with a data file. Plus with file-scoped position delete files (V2), matching won't need column stats too. With Delete Vector (DV) in V3, there won't be column stats written for position deletes.
On Tue, Jun 3, 2025 at 10:01 PM Yuya Ebihara <yuya.ebih...@starburstdata.com> wrote: > Hi, > > I've been investigating an OOM issue during planning in the Trino > coordinator, and I've found that the main cause is the column stats > handling in the DeleteFileIndex class - it loads all delete files into > memory. > While rewriting delete files is one option, I'd like to explore reducing > memory usage within the Iceberg library itself. > > I've opened a PR (#13161 <https://github.com/apache/iceberg/pull/13161>) > that reduces memory usage on the Trino coordinator from 12.8 GB to 2.5 GB > in my benchmark. > The change copies only the file_path stats in DeleteFileIndex when the > file is a positional delete. > > I'd appreciate your feedback on whether this is an acceptable approach, or > if you have other suggestions. > I understand that v4 will improve stats handling as part of #13153 > <https://github.com/apache/iceberg/issues/13153>, but in the Trino > community, we're also interested in reducing memory usage for tables using > formats earlier than v4. > > Thanks, > Yuya >