[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448365#comment-17448365
 ] 

FYung commented on FLINK-25027:
-------------------------------

Hi [~nkruber]
Thank you for creating this issue. I'm so glad to see it because we have the 
same problem about akka thread pool. 

We submit batch job to flink session cluster and find there're too many task in 
akka thread pool which will cause GC. Besides 
PhysicalSlotRequestBulkCheckerImpl, we also find tasks of `heartbeat`, timeout 
checker of `pending request`/`checkIdleSlotTimeout`/`checkBatchSlotTimeout` in 
DeclarativeSlotPoolBridge, timeout checker in 
`DefaultScheduler.registerProducedPartitions` and more. Akka thread pool will 
hold these instances even the jobs are finished.

Using a dedicated thread pool per JM to manage these tasks is a good idea. 
Maybe we should create a thread pool in JM at first, then create more subtasks 
in this issue to move tasks above from akka thread pool to the thread pool in 
JM. What do you think?  Hope to hear from you [~nkruber][~trohrmann] THX :)

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -------------------------------------------------------------------------
>
>                 Key: FLINK-25027
>                 URL: https://issues.apache.org/jira/browse/FLINK-25027
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.12.5, 1.13.3
>            Reporter: Nico Kruber
>            Priority: Major
>         Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to