[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

Shammon (Jira) Thu, 02 Dec 2021 18:12:12 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452693#comment-17452693
 ]


Shammon commented on FLINK-25027:
---------------------------------

[~guoyangze#1] I think it's a good question. The main problem of fullgc in JM 
is that it will increase the delay of small batch jobs in the session cluster. 
As you mentioned, too many periodic tasks can't be recycled in time which will 
lead to fullgc in JM. When i submit small batch jobs to flink session cluster 
as olap queries, the fullgcs in JM will increase latencies and decrease qps of 
these jobs, i think it's a very import issue

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -------------------------------------------------------------------------
>
>                 Key: FLINK-25027
>                 URL: https://issues.apache.org/jira/browse/FLINK-25027
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.12.5, 1.13.3
>            Reporter: Nico Kruber
>            Assignee: Shammon
>            Priority: Major
>             Fix For: 1.15.0, 1.14.1, 1.13.4
>
>         Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

Reply via email to