Nico Kruber created FLINK-25027:
-----------------------------------
Summary: Allow GC of a finished job's JobMaster before the slot
timeout is reached
Key: FLINK-25027
URL: https://issues.apache.org/jira/browse/FLINK-25027
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.13.3, 1.12.5, 1.14.0
Reporter: Nico Kruber
Attachments: image-2021-11-23-20-32-20-479.png
In a session cluster, after a (batch) job is finished, the JobMaster seems to
stick around for another couple of minutes before being eligible for garbage
collection.
Looking into a heap dump, it seems to be tied to a
{{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying Akka
executor (and keeps the JM from being GC’d). Per default the action is
scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks
[~trohrmann] for helping out here)
!image-2021-11-23-20-32-20-479.png!
With this setting, you will have to account for enough metaspace to cover 5
minutes of time which may span a couple of jobs, needlessly!
The problem seems to be that Flink is using the main thread executor for the
scheduling that uses the {{ActorSystem}}'s scheduler and the future task
scheduled with Akka can (probably) not be easily cancelled.
One idea could be to use a dedicated thread pool per JM, that we shut down when
the JM terminates. That way we would not keep the JM from being GC’d.
(The concrete example we investigated was a DataSet job)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)