[
https://issues.apache.org/jira/browse/FLINK-28499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566629#comment-17566629
]
Yang Wang commented on FLINK-28499:
-----------------------------------
BTW, if the JobManager pod did not crash backoff continuously, we will not have
so many residual init error TaskManager pods. Because Flink ResourceManager
will release the TM which does not register in the timeout.
> resource leak when job failed with unknown status In Application Mode
> ---------------------------------------------------------------------
>
> Key: FLINK-28499
> URL: https://issues.apache.org/jira/browse/FLINK-28499
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.13.1
> Reporter: lihe ma
> Priority: Major
> Attachments: cluster-pod-error.png
>
>
> I found a job restarted for thousands of times, and jobmanager tried to
> create a new taskmanager pod every time. The jobmanager restarted because
> submitted with duplicate job id[1] (we preset the jobId rather than
> generate), but I hadn't save the logs unfortunately.
> this job requires one taskmanager pod in normal circumstances, but thousands
> of pods were leaked finally. you can find the screenshot in the attachment.
>
> In application mode, cluster resources will be released when job finished in
> succeeded, failed or canceled status[2][3] . When some exception happen, the
> job may be terminated in unknown status[4] .
> In this case, the job exited with unknown status , without releasing
> taskmanager pods. So is it reasonable to not release taskmanager when job
> exited in unknown status ?
>
>
> one line in original logs:
> 2022-07-01 09:45:40,712 [main] INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster
> entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445.
>
> [1]
> [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452]
> [2]
> [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91]
> [3]
> [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175]
> [4]
> [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39]
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)