lihe ma created FLINK-28499:
-------------------------------

             Summary: resource leak when job failed with unknown status In 
Application Mode
                 Key: FLINK-28499
                 URL: https://issues.apache.org/jira/browse/FLINK-28499
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes
    Affects Versions: 1.13.1
            Reporter: lihe ma
         Attachments: cluster-pod-error.png

I found a job restarted for thousands of times, and jobmanager tried to create 
a new taskmanager pod every time.  The jobmanager restarted because submitted 
with duplicate  job id[1] (we preset the jobId rather than generate), but I 
hadn't save the logs unfortunately. 

this job requires one taskmanager pod in normal circumstances, but thousands of 
pods were leaked finally.
!image-2022-07-12-11-02-43-009.png|width=666,height=366!



In application mode, cluster resources will be released  when job finished in 
succeeded, failed or canceled status[2][3] . When some exception happen, the 
job may be terminated in unknown status[4] . 

In this case, the job exited with unknown status , without releasing  
taskmanager pods. So is it reasonable to not release taskmanager when job 
exited in unknown status ? 

 

 

one line in original logs:
2022-07-01 09:45:40,712 [main] INFO 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster 
entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445.

 

[1] 
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452]

[2] 
[https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91]


[3] 
[https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175]

[4] 
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39]

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to