[
https://issues.apache.org/jira/browse/FLINK-19171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193432#comment-17193432
]
Yi Tang edited comment on FLINK-19171 at 9/10/20, 7:19 AM:
-----------------------------------------------------------
Hi [~xintongsong] ,
Thanks for your reply.
It's not about why Flink should deal with these kind of cases. It's about make
things right.
A real case from mine, when i try to use Flink on K8s (a small cluster), and
submit a case require resources more than the cluster remains. As a result some
new pods can not be scheduled and stay at Pending phase.
Then i canceled the job, but those Pending pods didn't be released since the
resource release strategy.
So, i deleted them manually.
After these, i submit a new job. And it can not be assigned to a TM or trigger
resource allocation.
The FLINK-13554 about resource allocation timeout is needed, i think, and
checking resource correctly is also essential.
was (Author: yittg):
Hi Xintong,
Thanks for your reply.
It's not about why Flink should deal with these kind of cases. It's about make
things right.
A real case from mine, when i try to use Flink on K8s (a small cluster), and
submit a case require resources more than the cluster remains. As a result some
new pods can not be scheduled and stay at Pending phase.
Then i canceled the job, but those Pending pods didn't be released since the
resource release strategy.
So, i deleted them manually.
After these, i submit a new job. And it can not be assigned to a TM or trigger
resource allocation.
The FLINK-13554 about resource allocation timeout is needed, i think, and
checking resource correctly is also essential.
> K8s Resource Manager may lead to resource leak after pod deleted
> ----------------------------------------------------------------
>
> Key: FLINK-19171
> URL: https://issues.apache.org/jira/browse/FLINK-19171
> Project: Flink
> Issue Type: Bug
> Reporter: Yi Tang
> Priority: Minor
>
> {code:java}
> private void terminatedPodsInMainThread(List<KubernetesPod> pods) {
> getMainThreadExecutor().execute(() -> {
> for (KubernetesPod pod : pods) {
> if (pod.isTerminated()) {
> ...
> }
> }
> });
> }
> {code}
> Looks like that the RM only remove the pod from ledger if the pod
> "isTerminated",
> and the pod has been taken accounted after being created.
> However, it is not complete by checking pod "isTerminated", e.g. a Pending
> pod is deleted manually.
> After that, a new job requires more resource can not trigger the allocation
> of a new pod.
>
> Pls let me know if i misunderstand, thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)