[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

Shammon (Jira) Mon, 29 Nov 2021 04:30:12 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450409#comment-17450409
 ]


Shammon commented on FLINK-25027:
---------------------------------

[~guoyangze] Thank you for your advice, and it sounds good to me to have a 
unified solution for the periodic tasks. There're periodic tasks in 
ResourceManager and TaskManager too, they are scheduled by timer or akka thread 
pool. I will add a general scheduled thread pool in RpcEndpoint and schedules 
these periodic tasks, thanks

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -------------------------------------------------------------------------
>
>                 Key: FLINK-25027
>                 URL: https://issues.apache.org/jira/browse/FLINK-25027
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.12.5, 1.13.3
>            Reporter: Nico Kruber
>            Assignee: Shammon
>            Priority: Major
>             Fix For: 1.15.0, 1.14.1, 1.13.4
>
>         Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

Reply via email to