Re: Release delete file metadata earlier during planning to reduce memory pressure

Steven Wu Sun, 18 Jan 2026 22:55:48 -0800

Jian, Apache mailing lists don't allow image attachments in email. you may
want to put those in a doc and share the link to the doc.


On Sun, Jan 18, 2026 at 10:44 PM Jian Chen <[email protected]> wrote:

> Hi,
>
> Vaibhav, thanks for the reply, not sure why the attachments not visible on
> your side, I put the two snapshots below, shows the before and after
> changes the memory consumption on the same query:
> scan all table "SELECT * FROM xxx".
> You could using the similar code to build a partitioned table with large
> metadata and delete files(no matter equality or position delete), here is
> the demo piece of the code you can refer :
> (I am writing the test table through Trino, thus the method like 
> `computeActual`
> is available in Trino).
> String columns = IntStream.rangeClosed(1, 1000) .mapToObj(i -> "col_" + i
> + " VARCHAR") .collect(Collectors.joining(",")); String values =
> IntStream.rangeClosed(1, 1000) .mapToObj(i -> "lpad('',30,'x')")
> .collect(Collectors.joining(",")); computeActual("CREATE TABLE test(id int,
> %s) WITH (location =
> 'file:////Users/your_local_file_location')".formatted(columns));
> computeActual("INSERT INTO test VALUES (1,%s),(2,%s)".formatted(values,
> values)); Table table = loadTable("test"); DataFile file =
> table.newScan().planFiles().iterator().next().file(); DeleteFile deleteFile
> = writeEqualityDeleteToNationTableWithDeleteColumns(table,
> Optional.empty(), Optional.empty(), Map.of("id", 1), Optional.empty());
> IntStream.rangeClosed(1, 10_000).asLongStream().forEach(i -> { Transaction
> transaction = table.newTransaction(); AppendFiles appendFiles =
> transaction.newFastAppend(); appendFiles.appendFile(file);
> appendFiles.commit(); transaction.newRowDelta() .addDeletes(deleteFile)
> .commit(); transaction.commitTransaction(); });
>
> [image: image.png]
>
> [image: image.png]
>
>
>
> Hi Anton, the partition infos are maintained using two maps:
>
>    1. a map from partition value to row count, and
>    2. a map from ManifestFile to its corresponding partition values.
>
> These maps are prepared while building the ManifestGroup. During this
> phase, we need to read all data files, but we do *not* need to retain all
> data files in memory.
> This PR primarily focuses on releasing memory held by the delete index.
> Currently, during planning, all delete information is retained in memory
> until the planning phase completes, making delete index memory usage a
> dominant factor in OOM scenarios.Users with a large number of delete files
> will benefit the most from this change.
>
> Thanks
>
>
> Anton Okolnychyi <[email protected]> 于2026年1月16日周五 10:40写道：
>
>> Jian, if you have to compute partition counts for data files, it means
>> you have to read/keep all data files in memory. This breaks the ability to
>> stream through data files during planning.
>>
>> ср, 14 січ. 2026 р. о 04:54 Vaibhav Kumar <[email protected]> пише:
>>
>>> Hi Jian,
>>>
>>> Thank you for your recent contribution. To help better understand the
>>> proposed solution, would it be possible to include some additional
>>> test cases? It would also be very helpful if you could provide a brief
>>> write-up with an example illustrating how scan planning worked
>>> previously and how it behaves with the new approach. Also I think the
>>> memory usage snapshot is missing.
>>>
>>> This context will make it easier for others to review and provide
>>> feedback. Thanks so much for your efforts!
>>>
>>> Best regards,
>>> Vaibhav
>>>
>>>
>>> On Wed, Jan 14, 2026 at 10:40 AM Jian Chen <[email protected]>
>>> wrote:
>>> >
>>> > Hi Anton,
>>> >
>>> > Thanks for quick reply.
>>> >
>>> > - How do you plan to build the partition counts? Will it require
>>> opening data manifests as the manifest list only contains partition bounds?
>>> > we need to open data manifests to get the partition infos when
>>> building the ManifestGroup, I don't see a good/easy way to list all
>>> partitions in Iceberg today, do you know about it?
>>> >
>>> > - Today we stream through data files and emit scan tasks. Will you be
>>> able to preserve this behavior?
>>> > Yes, still the same, just add on a additional close action when
>>> finished planning for each of the data manifest.
>>> >
>>> > - Do you have any JMH benchmarks to validate your idea?
>>> > I don't have the JMH benchmarks for the performance. but did a profile
>>> with memory allocation:
>>> > Tested using 3 partitions with large amounts of data and with 10k+
>>> deletes files + Trino 477 + iceberg 1.10, below is the memory usage
>>> snapshot attached.
>>> >
>>> >
>>> > Anton Okolnychyi <[email protected]> 于2026年1月14日周三 12:26写道：
>>> >>
>>> >> - How do you plan to build the partition counts? Will it require
>>> opening data manifests as the manifest list only contains partition bounds?
>>> >> - Today we stream through data files and emit scan tasks. Will you be
>>> able to preserve this behavior?
>>> >> - Do you have any JMH benchmarks to validate your idea?
>>> >>
>>> >> - Anton
>>> >>
>>> >>
>>> >> On Tue, Jan 13, 2026 at 8:08 PM Jian Chen <[email protected]>
>>> wrote:
>>> >>>
>>> >>> Dear community,
>>> >>>
>>> >>> I would like to start a discussion around a potential improvement to
>>> planning-time memory usage for large tables with a high volume of delete
>>> files.
>>> >>>
>>> >>> When planning queries on large tables, especially delete-heavy
>>> tables, the planner currently keeps all delete file metadata in memory for
>>> the entire planning phase. For tables with many partitions and a large
>>> number of delete files, this can significantly increase memory pressure
>>> and, in extreme cases, lead to OOM issues during planning.
>>> >>>
>>> >>> Proposal
>>> >>>
>>> >>> The core idea is to allow delete file metadata to be released
>>> incrementally during planning, instead of being retained until the end.
>>> >>>
>>> >>> I've sent the pr shows how it looks like
>>> https://github.com/apache/iceberg/pull/14558
>>> >>>
>>> >>> Concretely, the proposal is to make ManifestGroup closeable so it
>>> can proactively release memory once it is no longer needed. The release
>>> logic is based on partition reference counting:
>>> >>>
>>> >>> At the beginning of planning, we track the reference count of
>>> partitions across all data manifests.
>>> >>>
>>> >>> As each data manifest finishes planning, the reference count for its
>>> associated partitions is decremented.
>>> >>>
>>> >>> Once a partition is no longer referenced by any remaining data
>>> files, its related delete files are no longer needed for planning.
>>> >>>
>>> >>> At that point, we use the partition value to remove and release the
>>> corresponding entries from DeleteFileIndex.
>>> >>>
>>> >>> Discussion
>>> >>>
>>> >>> I would appreciate feedback on:
>>> >>>
>>> >>> Whether this approach aligns with Iceberg’s planning model and
>>> lifecycle expectations?
>>> >>>
>>> >>> Any edge cases or correctness concerns you foresee?
>>>
>>

Re: Release delete file metadata earlier during planning to reduce memory pressure

Reply via email to