[jira] [Commented] (FLINK-28499) resource leak when job failed with unknown status In Application Mode

lihe ma (Jira) Wed, 13 Jul 2022 01:58:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566225#comment-17566225
 ]


lihe ma commented on FLINK-28499:
---------------------------------

[~wangyang0918]  Thanks and I appreciate your comment!. 
The extra taskmanager pods will be deleted when we kill the jobmanager 
deployment, but in this case, the jobmanager tried to submit job   ->  create 
new taskmanager -> job exited in unknown status -> jobmanager restarted -> 
tried to submit job and create taskmanager again, it didn't end until we found 
this error and kill the jobmanager.  When a job failed,  I expected to get 
several pods in restarting or error status rather than thousands of pods.

I am not sure how much of an impact these existing pods on the cluster, maybe 
we can release these pods when job exited in unknown status to avoid this? [1]

[1]https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L170-L171

 

> resource leak when job failed with unknown status In Application Mode
> ---------------------------------------------------------------------
>
>                 Key: FLINK-28499
>                 URL: https://issues.apache.org/jira/browse/FLINK-28499
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.1
>            Reporter: lihe ma
>            Priority: Major
>         Attachments: cluster-pod-error.png
>
>
> I found a job restarted for thousands of times, and jobmanager tried to 
> create a new taskmanager pod every time.  The jobmanager restarted because 
> submitted with duplicate  job id[1] (we preset the jobId rather than 
> generate), but I hadn't save the logs unfortunately. 
> this job requires one taskmanager pod in normal circumstances, but thousands 
> of pods were leaked finally.  you can find the screenshot in the attachment.
>  
> In application mode, cluster resources will be released  when job finished in 
> succeeded, failed or canceled status[2][3] . When some exception happen, the 
> job may be terminated in unknown status[4] . 
> In this case, the job exited with unknown status , without releasing  
> taskmanager pods. So is it reasonable to not release taskmanager when job 
> exited in unknown status ? 
>  
>  
> one line in original logs:
> 2022-07-01 09:45:40,712 [main] INFO 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster 
> entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445.
>  
> [1] 
> [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452]
> [2] 
> [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91]
> [3] 
> [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175]
> [4] 
> [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-28499) resource leak when job failed with unknown status In Application Mode

Reply via email to