Jian, if you have to compute partition counts for data files, it means you
have to read/keep all data files in memory. This breaks the ability to
stream through data files during planning.

ср, 14 січ. 2026 р. о 04:54 Vaibhav Kumar <[email protected]> пише:

> Hi Jian,
>
> Thank you for your recent contribution. To help better understand the
> proposed solution, would it be possible to include some additional
> test cases? It would also be very helpful if you could provide a brief
> write-up with an example illustrating how scan planning worked
> previously and how it behaves with the new approach. Also I think the
> memory usage snapshot is missing.
>
> This context will make it easier for others to review and provide
> feedback. Thanks so much for your efforts!
>
> Best regards,
> Vaibhav
>
>
> On Wed, Jan 14, 2026 at 10:40 AM Jian Chen <[email protected]> wrote:
> >
> > Hi Anton,
> >
> > Thanks for quick reply.
> >
> > - How do you plan to build the partition counts? Will it require opening
> data manifests as the manifest list only contains partition bounds?
> > we need to open data manifests to get the partition infos when building
> the ManifestGroup, I don't see a good/easy way to list all partitions in
> Iceberg today, do you know about it?
> >
> > - Today we stream through data files and emit scan tasks. Will you be
> able to preserve this behavior?
> > Yes, still the same, just add on a additional close action when finished
> planning for each of the data manifest.
> >
> > - Do you have any JMH benchmarks to validate your idea?
> > I don't have the JMH benchmarks for the performance. but did a profile
> with memory allocation:
> > Tested using 3 partitions with large amounts of data and with 10k+
> deletes files + Trino 477 + iceberg 1.10, below is the memory usage
> snapshot attached.
> >
> >
> > Anton Okolnychyi <[email protected]> 于2026年1月14日周三 12:26写道:
> >>
> >> - How do you plan to build the partition counts? Will it require
> opening data manifests as the manifest list only contains partition bounds?
> >> - Today we stream through data files and emit scan tasks. Will you be
> able to preserve this behavior?
> >> - Do you have any JMH benchmarks to validate your idea?
> >>
> >> - Anton
> >>
> >>
> >> On Tue, Jan 13, 2026 at 8:08 PM Jian Chen <[email protected]>
> wrote:
> >>>
> >>> Dear community,
> >>>
> >>> I would like to start a discussion around a potential improvement to
> planning-time memory usage for large tables with a high volume of delete
> files.
> >>>
> >>> When planning queries on large tables, especially delete-heavy tables,
> the planner currently keeps all delete file metadata in memory for the
> entire planning phase. For tables with many partitions and a large number
> of delete files, this can significantly increase memory pressure and, in
> extreme cases, lead to OOM issues during planning.
> >>>
> >>> Proposal
> >>>
> >>> The core idea is to allow delete file metadata to be released
> incrementally during planning, instead of being retained until the end.
> >>>
> >>> I've sent the pr shows how it looks like
> https://github.com/apache/iceberg/pull/14558
> >>>
> >>> Concretely, the proposal is to make ManifestGroup closeable so it can
> proactively release memory once it is no longer needed. The release logic
> is based on partition reference counting:
> >>>
> >>> At the beginning of planning, we track the reference count of
> partitions across all data manifests.
> >>>
> >>> As each data manifest finishes planning, the reference count for its
> associated partitions is decremented.
> >>>
> >>> Once a partition is no longer referenced by any remaining data files,
> its related delete files are no longer needed for planning.
> >>>
> >>> At that point, we use the partition value to remove and release the
> corresponding entries from DeleteFileIndex.
> >>>
> >>> Discussion
> >>>
> >>> I would appreciate feedback on:
> >>>
> >>> Whether this approach aligns with Iceberg’s planning model and
> lifecycle expectations?
> >>>
> >>> Any edge cases or correctness concerns you foresee?
>

Reply via email to