[
https://issues.apache.org/jira/browse/FLINK-24377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-24377:
-----------------------------------
Labels: pull-request-available (was: )
> TM resource may not be properly released after heartbeat timeout
> ----------------------------------------------------------------
>
> Key: FLINK-24377
> URL: https://issues.apache.org/jira/browse/FLINK-24377
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes, Deployment / YARN, Runtime /
> Coordination
> Affects Versions: 1.14.0, 1.13.2
> Reporter: Xintong Song
> Assignee: Xintong Song
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.14.0, 1.13.3, 1.15.0
>
>
> In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat
> times out. However, it does not actively release the pod / container of that
> TM. The releasing of pod / container relies on the TM to terminate itself
> after failing to re-register to the RM.
> In some rare conditions, the TM process may not terminate and hang out for
> long time. In such cases, k8s / yarn sees the process running, thus will not
> release the pod / container. Neither will Flink's resource manager.
> Consequently, the resource is leaked until the entire application is
> terminated.
> To fix this, we should make {{ActiveResourceManager}} to actively release the
> resource to K8s / Yarn after a TM heartbeat timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)