Hi,

Vaibhav, thanks for the reply, not sure why the attachments not visible on
your side, I put the two snapshots below, shows the before and after
changes the memory consumption on the same query:
scan all table "SELECT * FROM xxx".
You could using the similar code to build a partitioned table with large
metadata and delete files(no matter equality or position delete), here is
the demo piece of the code you can refer :
(I am writing the test table through Trino, thus the method like
`computeActual`
is available in Trino).
String columns = IntStream.rangeClosed(1, 1000) .mapToObj(i -> "col_" + i +
" VARCHAR") .collect(Collectors.joining(",")); String values =
IntStream.rangeClosed(1, 1000) .mapToObj(i -> "lpad('',30,'x')")
.collect(Collectors.joining(",")); computeActual("CREATE TABLE test(id int,
%s) WITH (location =
'file:////Users/your_local_file_location')".formatted(columns));
computeActual("INSERT INTO test VALUES (1,%s),(2,%s)".formatted(values,
values)); Table table = loadTable("test"); DataFile file =
table.newScan().planFiles().iterator().next().file(); DeleteFile deleteFile
= writeEqualityDeleteToNationTableWithDeleteColumns(table,
Optional.empty(), Optional.empty(), Map.of("id", 1), Optional.empty());
IntStream.rangeClosed(1, 10_000).asLongStream().forEach(i -> { Transaction
transaction = table.newTransaction(); AppendFiles appendFiles =
transaction.newFastAppend(); appendFiles.appendFile(file);
appendFiles.commit(); transaction.newRowDelta() .addDeletes(deleteFile)
.commit(); transaction.commitTransaction(); });

[image: image.png]

[image: image.png]



Hi Anton, the partition infos are maintained using two maps:

   1. a map from partition value to row count, and
   2. a map from ManifestFile to its corresponding partition values.

These maps are prepared while building the ManifestGroup. During this
phase, we need to read all data files, but we do *not* need to retain all
data files in memory.
This PR primarily focuses on releasing memory held by the delete index.
Currently, during planning, all delete information is retained in memory
until the planning phase completes, making delete index memory usage a
dominant factor in OOM scenarios.Users with a large number of delete files
will benefit the most from this change.

Thanks


Anton Okolnychyi <[email protected]> 于2026年1月16日周五 10:40写道:

> Jian, if you have to compute partition counts for data files, it means you
> have to read/keep all data files in memory. This breaks the ability to
> stream through data files during planning.
>
> ср, 14 січ. 2026 р. о 04:54 Vaibhav Kumar <[email protected]> пише:
>
>> Hi Jian,
>>
>> Thank you for your recent contribution. To help better understand the
>> proposed solution, would it be possible to include some additional
>> test cases? It would also be very helpful if you could provide a brief
>> write-up with an example illustrating how scan planning worked
>> previously and how it behaves with the new approach. Also I think the
>> memory usage snapshot is missing.
>>
>> This context will make it easier for others to review and provide
>> feedback. Thanks so much for your efforts!
>>
>> Best regards,
>> Vaibhav
>>
>>
>> On Wed, Jan 14, 2026 at 10:40 AM Jian Chen <[email protected]>
>> wrote:
>> >
>> > Hi Anton,
>> >
>> > Thanks for quick reply.
>> >
>> > - How do you plan to build the partition counts? Will it require
>> opening data manifests as the manifest list only contains partition bounds?
>> > we need to open data manifests to get the partition infos when building
>> the ManifestGroup, I don't see a good/easy way to list all partitions in
>> Iceberg today, do you know about it?
>> >
>> > - Today we stream through data files and emit scan tasks. Will you be
>> able to preserve this behavior?
>> > Yes, still the same, just add on a additional close action when
>> finished planning for each of the data manifest.
>> >
>> > - Do you have any JMH benchmarks to validate your idea?
>> > I don't have the JMH benchmarks for the performance. but did a profile
>> with memory allocation:
>> > Tested using 3 partitions with large amounts of data and with 10k+
>> deletes files + Trino 477 + iceberg 1.10, below is the memory usage
>> snapshot attached.
>> >
>> >
>> > Anton Okolnychyi <[email protected]> 于2026年1月14日周三 12:26写道:
>> >>
>> >> - How do you plan to build the partition counts? Will it require
>> opening data manifests as the manifest list only contains partition bounds?
>> >> - Today we stream through data files and emit scan tasks. Will you be
>> able to preserve this behavior?
>> >> - Do you have any JMH benchmarks to validate your idea?
>> >>
>> >> - Anton
>> >>
>> >>
>> >> On Tue, Jan 13, 2026 at 8:08 PM Jian Chen <[email protected]>
>> wrote:
>> >>>
>> >>> Dear community,
>> >>>
>> >>> I would like to start a discussion around a potential improvement to
>> planning-time memory usage for large tables with a high volume of delete
>> files.
>> >>>
>> >>> When planning queries on large tables, especially delete-heavy
>> tables, the planner currently keeps all delete file metadata in memory for
>> the entire planning phase. For tables with many partitions and a large
>> number of delete files, this can significantly increase memory pressure
>> and, in extreme cases, lead to OOM issues during planning.
>> >>>
>> >>> Proposal
>> >>>
>> >>> The core idea is to allow delete file metadata to be released
>> incrementally during planning, instead of being retained until the end.
>> >>>
>> >>> I've sent the pr shows how it looks like
>> https://github.com/apache/iceberg/pull/14558
>> >>>
>> >>> Concretely, the proposal is to make ManifestGroup closeable so it can
>> proactively release memory once it is no longer needed. The release logic
>> is based on partition reference counting:
>> >>>
>> >>> At the beginning of planning, we track the reference count of
>> partitions across all data manifests.
>> >>>
>> >>> As each data manifest finishes planning, the reference count for its
>> associated partitions is decremented.
>> >>>
>> >>> Once a partition is no longer referenced by any remaining data files,
>> its related delete files are no longer needed for planning.
>> >>>
>> >>> At that point, we use the partition value to remove and release the
>> corresponding entries from DeleteFileIndex.
>> >>>
>> >>> Discussion
>> >>>
>> >>> I would appreciate feedback on:
>> >>>
>> >>> Whether this approach aligns with Iceberg’s planning model and
>> lifecycle expectations?
>> >>>
>> >>> Any edge cases or correctness concerns you foresee?
>>
>

Reply via email to