Xintong Song created FLINK-24377:
------------------------------------

             Summary: TM resource may not be properly released after heartbeat 
timeout
                 Key: FLINK-24377
                 URL: https://issues.apache.org/jira/browse/FLINK-24377
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
Coordination
    Affects Versions: 1.13.2, 1.14.0
            Reporter: Xintong Song
            Assignee: Xintong Song
             Fix For: 1.14.0, 1.13.3, 1.15.0


In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat 
times out. However, it does not actively release the pod / container of that 
TM. The releasing of pod / container relies on the TM to terminate itself after 
failing to re-register to the RM.

In some rare conditions, the TM process may not terminate and hang out for long 
time. In such cases, k8s / yarn sees the process running, thus will not release 
the pod / container. Neither will Flink's resource manager. Consequently, the 
resource is leaked until the entire application is terminated.

To fix this, we should make {{ActiveResourceManager}} to actively release the 
resource to K8s / Yarn after a TM heartbeat timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to