Jian, Apache mailing lists don't allow image attachments in email. you may want to put those in a doc and share the link to the doc.
On Sun, Jan 18, 2026 at 10:44 PM Jian Chen <[email protected]> wrote: > Hi, > > Vaibhav, thanks for the reply, not sure why the attachments not visible on > your side, I put the two snapshots below, shows the before and after > changes the memory consumption on the same query: > scan all table "SELECT * FROM xxx". > You could using the similar code to build a partitioned table with large > metadata and delete files(no matter equality or position delete), here is > the demo piece of the code you can refer : > (I am writing the test table through Trino, thus the method like > `computeActual` > is available in Trino). > String columns = IntStream.rangeClosed(1, 1000) .mapToObj(i -> "col_" + i > + " VARCHAR") .collect(Collectors.joining(",")); String values = > IntStream.rangeClosed(1, 1000) .mapToObj(i -> "lpad('',30,'x')") > .collect(Collectors.joining(",")); computeActual("CREATE TABLE test(id int, > %s) WITH (location = > 'file:////Users/your_local_file_location')".formatted(columns)); > computeActual("INSERT INTO test VALUES (1,%s),(2,%s)".formatted(values, > values)); Table table = loadTable("test"); DataFile file = > table.newScan().planFiles().iterator().next().file(); DeleteFile deleteFile > = writeEqualityDeleteToNationTableWithDeleteColumns(table, > Optional.empty(), Optional.empty(), Map.of("id", 1), Optional.empty()); > IntStream.rangeClosed(1, 10_000).asLongStream().forEach(i -> { Transaction > transaction = table.newTransaction(); AppendFiles appendFiles = > transaction.newFastAppend(); appendFiles.appendFile(file); > appendFiles.commit(); transaction.newRowDelta() .addDeletes(deleteFile) > .commit(); transaction.commitTransaction(); }); > > [image: image.png] > > [image: image.png] > > > > Hi Anton, the partition infos are maintained using two maps: > > 1. a map from partition value to row count, and > 2. a map from ManifestFile to its corresponding partition values. > > These maps are prepared while building the ManifestGroup. During this > phase, we need to read all data files, but we do *not* need to retain all > data files in memory. > This PR primarily focuses on releasing memory held by the delete index. > Currently, during planning, all delete information is retained in memory > until the planning phase completes, making delete index memory usage a > dominant factor in OOM scenarios.Users with a large number of delete files > will benefit the most from this change. > > Thanks > > > Anton Okolnychyi <[email protected]> 于2026年1月16日周五 10:40写道: > >> Jian, if you have to compute partition counts for data files, it means >> you have to read/keep all data files in memory. This breaks the ability to >> stream through data files during planning. >> >> ср, 14 січ. 2026 р. о 04:54 Vaibhav Kumar <[email protected]> пише: >> >>> Hi Jian, >>> >>> Thank you for your recent contribution. To help better understand the >>> proposed solution, would it be possible to include some additional >>> test cases? It would also be very helpful if you could provide a brief >>> write-up with an example illustrating how scan planning worked >>> previously and how it behaves with the new approach. Also I think the >>> memory usage snapshot is missing. >>> >>> This context will make it easier for others to review and provide >>> feedback. Thanks so much for your efforts! >>> >>> Best regards, >>> Vaibhav >>> >>> >>> On Wed, Jan 14, 2026 at 10:40 AM Jian Chen <[email protected]> >>> wrote: >>> > >>> > Hi Anton, >>> > >>> > Thanks for quick reply. >>> > >>> > - How do you plan to build the partition counts? Will it require >>> opening data manifests as the manifest list only contains partition bounds? >>> > we need to open data manifests to get the partition infos when >>> building the ManifestGroup, I don't see a good/easy way to list all >>> partitions in Iceberg today, do you know about it? >>> > >>> > - Today we stream through data files and emit scan tasks. Will you be >>> able to preserve this behavior? >>> > Yes, still the same, just add on a additional close action when >>> finished planning for each of the data manifest. >>> > >>> > - Do you have any JMH benchmarks to validate your idea? >>> > I don't have the JMH benchmarks for the performance. but did a profile >>> with memory allocation: >>> > Tested using 3 partitions with large amounts of data and with 10k+ >>> deletes files + Trino 477 + iceberg 1.10, below is the memory usage >>> snapshot attached. >>> > >>> > >>> > Anton Okolnychyi <[email protected]> 于2026年1月14日周三 12:26写道: >>> >> >>> >> - How do you plan to build the partition counts? Will it require >>> opening data manifests as the manifest list only contains partition bounds? >>> >> - Today we stream through data files and emit scan tasks. Will you be >>> able to preserve this behavior? >>> >> - Do you have any JMH benchmarks to validate your idea? >>> >> >>> >> - Anton >>> >> >>> >> >>> >> On Tue, Jan 13, 2026 at 8:08 PM Jian Chen <[email protected]> >>> wrote: >>> >>> >>> >>> Dear community, >>> >>> >>> >>> I would like to start a discussion around a potential improvement to >>> planning-time memory usage for large tables with a high volume of delete >>> files. >>> >>> >>> >>> When planning queries on large tables, especially delete-heavy >>> tables, the planner currently keeps all delete file metadata in memory for >>> the entire planning phase. For tables with many partitions and a large >>> number of delete files, this can significantly increase memory pressure >>> and, in extreme cases, lead to OOM issues during planning. >>> >>> >>> >>> Proposal >>> >>> >>> >>> The core idea is to allow delete file metadata to be released >>> incrementally during planning, instead of being retained until the end. >>> >>> >>> >>> I've sent the pr shows how it looks like >>> https://github.com/apache/iceberg/pull/14558 >>> >>> >>> >>> Concretely, the proposal is to make ManifestGroup closeable so it >>> can proactively release memory once it is no longer needed. The release >>> logic is based on partition reference counting: >>> >>> >>> >>> At the beginning of planning, we track the reference count of >>> partitions across all data manifests. >>> >>> >>> >>> As each data manifest finishes planning, the reference count for its >>> associated partitions is decremented. >>> >>> >>> >>> Once a partition is no longer referenced by any remaining data >>> files, its related delete files are no longer needed for planning. >>> >>> >>> >>> At that point, we use the partition value to remove and release the >>> corresponding entries from DeleteFileIndex. >>> >>> >>> >>> Discussion >>> >>> >>> >>> I would appreciate feedback on: >>> >>> >>> >>> Whether this approach aligns with Iceberg’s planning model and >>> lifecycle expectations? >>> >>> >>> >>> Any edge cases or correctness concerns you foresee? >>> >>
