Hi all, I just figured out that there is a chance of running into OOMs when purging tables. Even if an OOM does not occur, there is a huge amount of heap pressure that deserves broader attention.
TL;DR all manifest-files are materialized on heap at once as base-64 encoded strings (~ +33%) on the Java heap. Each manifest file can become really big, quite some MB of binary data per manifest file. The logic to prune Iceberg tables roughly works as follows. Input parameter is the pointer to a table metadata. 1. Read table metadata 2. Read manifest lists of all snapshots 3. Read all manifest files of all manifest lists 4. Create a new task entity for each manifest file, containing the base-64 encoded serialized manifest-file The base64 encoded full binary Iceberg manifest files [1] are included in the task entities, which are consumed here [2]. Although some Java stream handling is being used, all manifest-files of the table to purge are materialized at once on the Java heap [3]. Even worse, the base64 encoded data is added to a JSON serialized object, which is added to a properties bag, which is in turn JSON serialized. Since JSON re-serializations of that property bag are "normal", the total heap pressure is bigger than the sum of all base64 serialized manifest-files. The good part is that Polaris service can safely start up again and resume normal operations, as it does not attempt to resume unfinished tasks at all. There is a corresponding GitHub issue for this. Robert [1] https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java#L194 [2] https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/ManifestFileCleanupTaskHandler.java#L67 [3] https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java#L111-L130 [4] https://github.com/apache/polaris/issues/2365