Jian, if you have to compute partition counts for data files, it means you have to read/keep all data files in memory. This breaks the ability to stream through data files during planning.
ср, 14 січ. 2026 р. о 04:54 Vaibhav Kumar <[email protected]> пише: > Hi Jian, > > Thank you for your recent contribution. To help better understand the > proposed solution, would it be possible to include some additional > test cases? It would also be very helpful if you could provide a brief > write-up with an example illustrating how scan planning worked > previously and how it behaves with the new approach. Also I think the > memory usage snapshot is missing. > > This context will make it easier for others to review and provide > feedback. Thanks so much for your efforts! > > Best regards, > Vaibhav > > > On Wed, Jan 14, 2026 at 10:40 AM Jian Chen <[email protected]> wrote: > > > > Hi Anton, > > > > Thanks for quick reply. > > > > - How do you plan to build the partition counts? Will it require opening > data manifests as the manifest list only contains partition bounds? > > we need to open data manifests to get the partition infos when building > the ManifestGroup, I don't see a good/easy way to list all partitions in > Iceberg today, do you know about it? > > > > - Today we stream through data files and emit scan tasks. Will you be > able to preserve this behavior? > > Yes, still the same, just add on a additional close action when finished > planning for each of the data manifest. > > > > - Do you have any JMH benchmarks to validate your idea? > > I don't have the JMH benchmarks for the performance. but did a profile > with memory allocation: > > Tested using 3 partitions with large amounts of data and with 10k+ > deletes files + Trino 477 + iceberg 1.10, below is the memory usage > snapshot attached. > > > > > > Anton Okolnychyi <[email protected]> 于2026年1月14日周三 12:26写道: > >> > >> - How do you plan to build the partition counts? Will it require > opening data manifests as the manifest list only contains partition bounds? > >> - Today we stream through data files and emit scan tasks. Will you be > able to preserve this behavior? > >> - Do you have any JMH benchmarks to validate your idea? > >> > >> - Anton > >> > >> > >> On Tue, Jan 13, 2026 at 8:08 PM Jian Chen <[email protected]> > wrote: > >>> > >>> Dear community, > >>> > >>> I would like to start a discussion around a potential improvement to > planning-time memory usage for large tables with a high volume of delete > files. > >>> > >>> When planning queries on large tables, especially delete-heavy tables, > the planner currently keeps all delete file metadata in memory for the > entire planning phase. For tables with many partitions and a large number > of delete files, this can significantly increase memory pressure and, in > extreme cases, lead to OOM issues during planning. > >>> > >>> Proposal > >>> > >>> The core idea is to allow delete file metadata to be released > incrementally during planning, instead of being retained until the end. > >>> > >>> I've sent the pr shows how it looks like > https://github.com/apache/iceberg/pull/14558 > >>> > >>> Concretely, the proposal is to make ManifestGroup closeable so it can > proactively release memory once it is no longer needed. The release logic > is based on partition reference counting: > >>> > >>> At the beginning of planning, we track the reference count of > partitions across all data manifests. > >>> > >>> As each data manifest finishes planning, the reference count for its > associated partitions is decremented. > >>> > >>> Once a partition is no longer referenced by any remaining data files, > its related delete files are no longer needed for planning. > >>> > >>> At that point, we use the partition value to remove and release the > corresponding entries from DeleteFileIndex. > >>> > >>> Discussion > >>> > >>> I would appreciate feedback on: > >>> > >>> Whether this approach aligns with Iceberg’s planning model and > lifecycle expectations? > >>> > >>> Any edge cases or correctness concerns you foresee? >
