[ 
https://issues.apache.org/jira/browse/FLINK-19171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193432#comment-17193432
 ] 

Yi Tang edited comment on FLINK-19171 at 9/10/20, 7:19 AM:
-----------------------------------------------------------

Hi [~xintongsong] ,

Thanks for your reply.

 

It's not about why Flink should deal with these kind of cases.  It's about make 
things right.

 

A real case from mine, when i try to use Flink on K8s (a small cluster), and 
submit a case require resources more than the cluster remains. As a result some 
new pods can not be scheduled and stay at Pending phase.

Then i canceled the job, but those Pending pods didn't be released since the 
resource release strategy.

So, i deleted them manually.

After these, i submit a new job. And it can not be assigned to a TM or trigger 
resource allocation.

 

The FLINK-13554 about resource allocation timeout is needed, i think, and 
checking resource correctly is also essential.


was (Author: yittg):
Hi Xintong,

Thanks for your reply.

 

It's not about why Flink should deal with these kind of cases.  It's about make 
things right.

 

A real case from mine, when i try to use Flink on K8s (a small cluster), and 
submit a case require resources more than the cluster remains. As a result some 
new pods can not be scheduled and stay at Pending phase.

Then i canceled the job, but those Pending pods didn't be released since the 
resource release strategy.

So, i deleted them manually.

After these, i submit a new job. And it can not be assigned to a TM or trigger 
resource allocation.

 

The FLINK-13554 about resource allocation timeout is needed, i think, and 
checking resource correctly is also essential.

> K8s Resource Manager may lead to resource leak after pod deleted
> ----------------------------------------------------------------
>
>                 Key: FLINK-19171
>                 URL: https://issues.apache.org/jira/browse/FLINK-19171
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yi Tang
>            Priority: Minor
>
> {code:java}
> private void terminatedPodsInMainThread(List<KubernetesPod> pods) {
>    getMainThreadExecutor().execute(() -> {
>       for (KubernetesPod pod : pods) {
>          if (pod.isTerminated()) {
>             ...
>          }
>       }
>    });
> }
> {code}
> Looks like that the RM only remove the pod from ledger if the pod 
> "isTerminated", 
> and the pod has been taken accounted after being created.
> However, it is not complete by checking pod "isTerminated", e.g. a Pending 
> pod is deleted manually.
> After that, a new job requires more resource can not trigger the allocation 
> of a new pod.
>  
> Pls let me know if i misunderstand, thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to