[ https://issues.apache.org/jira/browse/FLINK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xintong Song reassigned FLINK-31652: ------------------------------------ Assignee: xingbe > Flink should handle the delete event if the pod was deleted while pending > ------------------------------------------------------------------------- > > Key: FLINK-31652 > URL: https://issues.apache.org/jira/browse/FLINK-31652 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.17.0, 1.16.1 > Reporter: xingbe > Assignee: xingbe > Priority: Major > > I found that in kubernetes deployment, if the taskmanager pod is deleted in > 'Pending' phase, the flink job will get stuck and keep waiting for the pod > scheduled. We can reproduce this issue with the 'kubectl delete pod' command > to delete the pod when it is in the pending phase. > > The cause reason is that the pod status will not be updated in time in this > case, so the KubernetesResourceManagerDriver won't detect the pod is > terminated, and I also verified this by logging the pod status in > KubernetesPod#isTerminated(), and it shows as follows. > {code:java} > public boolean isTerminated() { > log.info("pod status: " + getInternalResource().getStatus()); > if (getInternalResource().getStatus() != null) { > final boolean podFailed = > > PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase()); > final boolean containersFailed = > > getInternalResource().getStatus().getContainerStatuses().stream() > .anyMatch( > e -> > e.getState() != null > && > e.getState().getTerminated() != null); > return containersFailed || podFailed; > } > return false; > } {code} > In the case, this function will return false because `containersFailed` and > `podFailed` are both false. > {code:java} > PodStatus(conditions=[PodCondition(lastProbeTime=null, > lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False, > type=PodScheduled, additionalProperties={})], containerStatuses=[], > ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[], > message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[], > qosClass=Guaranteed, reason=null, startTime=null, > additionalProperties={}){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)