[ 
https://issues.apache.org/jira/browse/FLINK-19171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193481#comment-17193481
 ] 

Xintong Song commented on FLINK-19171:
--------------------------------------

[~yittg],

Thanks for sharing the details on your case.

Correct me if I'm wrong, it seems to me the real problem in your case is that 
when the job is canceled Flink has not removed the pending pods, which it no 
longer needs. If the pending pods are properly removed, it won't be necessary 
for you to delete the pods manually, and there should be no problem for the 
later jobs.

I think it is reasonable for Flink to assume that there won't be another 
third-party that communicates with Kubernetes and manipulate with its pods, 
unless the third-party is absolutely necessary. That's why I asked for the 
reason of manual pod deletions.

FYI, another ticket (FLINK-18229) is tracking the issue of cleaning up pending 
workers. Hopefully that solves your problem.

For both issues (FLINK-13554/18229), we are targeting to resolve them in the 
1.12 release. Unfortunately, they are both blocked by other issues at the 
moment. 

> K8s Resource Manager may lead to resource leak after pod deleted
> ----------------------------------------------------------------
>
>                 Key: FLINK-19171
>                 URL: https://issues.apache.org/jira/browse/FLINK-19171
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yi Tang
>            Priority: Minor
>
> {code:java}
> private void terminatedPodsInMainThread(List<KubernetesPod> pods) {
>    getMainThreadExecutor().execute(() -> {
>       for (KubernetesPod pod : pods) {
>          if (pod.isTerminated()) {
>             ...
>          }
>       }
>    });
> }
> {code}
> Looks like that the RM only remove the pod from ledger if the pod 
> "isTerminated", 
> and the pod has been taken accounted after being created.
> However, it is not complete by checking pod "isTerminated", e.g. a Pending 
> pod is deleted manually.
> After that, a new job requires more resource can not trigger the allocation 
> of a new pod.
>  
> Pls let me know if i misunderstand, thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to