[ 
https://issues.apache.org/jira/browse/FLINK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706356#comment-17706356
 ] 

Xintong Song commented on FLINK-31652:
--------------------------------------

[~xiasun], thanks for reporting this. I think this is a valid issue.

Would you like to open a pull request to fix this?

> Flink should handle the delete event if the pod was deleted while pending
> -------------------------------------------------------------------------
>
>                 Key: FLINK-31652
>                 URL: https://issues.apache.org/jira/browse/FLINK-31652
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.17.0, 1.16.1
>            Reporter: xingbe
>            Priority: Major
>
> I found that in kubernetes deployment, if the taskmanager pod is deleted in 
> 'Pending' phase, the flink job will get stuck and keep waiting for the pod 
> scheduled. We can reproduce this issue with the 'kubectl delete pod' command 
> to delete the pod when it is in the pending phase.
>  
> The cause reason is that the pod status will not be updated in time in this 
> case, so the KubernetesResourceManagerDriver won't detect the pod is 
> terminated, and I also verified this by logging the pod status in 
> KubernetesPod#isTerminated(), and it shows as follows.
> {code:java}
> public boolean isTerminated() {
>     log.info("pod status: " + getInternalResource().getStatus());
>     if (getInternalResource().getStatus() != null) {
>         final boolean podFailed =
>                 
> PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase());
>         final boolean containersFailed =
>                 
> getInternalResource().getStatus().getContainerStatuses().stream()
>                         .anyMatch(
>                                 e ->
>                                         e.getState() != null
>                                                 && 
> e.getState().getTerminated() != null);
>         return containersFailed || podFailed;
>     }
>     return false;
> } {code}
> In the case, this function will return false because `containersFailed` and 
> `podFailed` are both false.
> {code:java}
> PodStatus(conditions=[PodCondition(lastProbeTime=null, 
> lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False, 
> type=PodScheduled, additionalProperties={})], containerStatuses=[], 
> ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[], 
> message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[], 
> qosClass=Guaranteed, reason=null, startTime=null, 
> additionalProperties={}){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to