[jira] [Commented] (FLINK-19289) K8s resource manager terminated pod garbage collection

Xintong Song (Jira) Fri, 18 Sep 2020 02:52:37 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198250#comment-17198250
 ]


Xintong Song commented on FLINK-19289:
--------------------------------------

Double checked the k8s watching logic. I think you are right, only ADDED event 
will be received for the terminated pods. This is indeed a valid bug, thanks 
for reporting this [~yittg].

Regarding the solution, I would suggest to check the pod status in 
{{recoverWorkerNodesFromPreviousAttempts}} and remove pods that are already 
terminated, rather than handle this in {{onAdded}}. If I understand correctly, 
only pods that are recovered from previous attempt have this problem. To that 
end, removing the pods in {{recoverWorkerNodesFromPreviousAttempt}} might be 
better since this method only affects recovered pods, unlike {{onAdded}} 
affects both recovered and new pods.

What do you think?

> K8s resource manager terminated pod garbage collection
> ------------------------------------------------------
>
>                 Key: FLINK-19289
>                 URL: https://issues.apache.org/jira/browse/FLINK-19289
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yi Tang
>            Priority: Minor
>
> For a senario,
> During JM is down (no JM is running), a TM down with error (for reasons from 
> the node or TM inner), then an Error pod present there. After one JM recover, 
> it will receive a ADDED event about this pod and do nothing.
> We should deal with this case in `onAdded` callback properly, I think.
> cc [~xintongsong].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19289) K8s resource manager terminated pod garbage collection

Reply via email to