Hi all,

I just figured out that there is a chance of running into OOMs when
purging tables. Even if an OOM does not occur, there is a huge amount
of heap pressure that deserves broader attention.

TL;DR all manifest-files are materialized on heap at once as base-64
encoded strings (~ +33%) on the Java heap. Each manifest file can
become really big, quite some MB of binary data per manifest file.

The logic to prune Iceberg tables roughly works as follows. Input
parameter is the pointer to a table metadata.
1. Read table metadata
2. Read manifest lists of all snapshots
3. Read all manifest files of all manifest lists
4. Create a new task entity for each manifest file, containing the
base-64 encoded serialized manifest-file

The base64 encoded full binary Iceberg manifest files [1] are included
in the task entities, which are consumed here [2].

Although some Java stream handling is being used, all manifest-files
of the table to purge are materialized at once on the Java heap [3].

Even worse, the base64 encoded data is added to a JSON serialized
object, which is added to a properties bag, which is in turn JSON
serialized. Since JSON re-serializations of that property bag are
"normal", the total heap pressure is bigger than the sum of all base64
serialized manifest-files.

The good part is that Polaris service can safely start up again and
resume normal operations, as it does not attempt to resume unfinished
tasks at all.

There is a corresponding GitHub issue for this.

Robert


[1] 
https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java#L194
[2] 
https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/ManifestFileCleanupTaskHandler.java#L67
[3] 
https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java#L111-L130
[4] https://github.com/apache/polaris/issues/2365

Reply via email to