Dear community, I would like to start a discussion around a potential improvement to planning-time memory usage for large tables with a high volume of delete files.
When planning queries on large tables, especially delete-heavy tables, the planner currently keeps all delete file metadata in memory for the entire planning phase. For tables with many partitions and a large number of delete files, this can significantly increase memory pressure and, in extreme cases, lead to OOM issues during planning. *Proposal* The core idea is to allow delete file metadata to be released incrementally during planning, instead of being retained until the end. I've sent the pr shows how it looks like https://github.com/apache/iceberg/pull/14558 Concretely, the proposal is to make ManifestGroup closeable so it can proactively release memory once it is no longer needed. The release logic is based on *partition reference counting:* - At the beginning of planning, we track the reference count of partitions across all data manifests. - As each data manifest finishes planning, the reference count for its associated partitions is decremented. - Once a partition is no longer referenced by any remaining data files, its related delete files are no longer needed for planning. - At that point, we use the partition value to remove and release the corresponding entries from DeleteFileIndex. *Discussion* I would appreciate feedback on: - Whether this approach aligns with Iceberg’s planning model and lifecycle expectations? - Any edge cases or correctness concerns you foresee?
