xingbe created FLINK-31652:
------------------------------
Summary: Flink should handle the delete event if the pod was
deleted while pending
Key: FLINK-31652
URL: https://issues.apache.org/jira/browse/FLINK-31652
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes
Affects Versions: 1.16.1, 1.17.0
Reporter: xingbe
I found that in kubernetes deployment, if the taskmanager pod is deleted in
'Pending' phase, the flink job will get stuck and keep waiting for the pod
scheduled. We can reproduce this issue with the 'kubectl delete pod' command to
delete the pod when it is in the pending phase.
The cause reason is that the pod status will not be updated in time in this
case, so the KubernetesResourceManagerDriver won't detect the pod is
terminated, and I also verified this by logging the pod status in
KubernetesPod#isTerminated(), and it shows as follows.
{code:java}
public boolean isTerminated() {
log.info("pod status: " + getInternalResource().getStatus());
if (getInternalResource().getStatus() != null) {
final boolean podFailed =
PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase());
final boolean containersFailed =
getInternalResource().getStatus().getContainerStatuses().stream()
.anyMatch(
e ->
e.getState() != null
&& e.getState().getTerminated()
!= null);
return containersFailed || podFailed;
}
return false;
} {code}
In the case, this function will return false because `containersFailed` and
`podFailed` are both false.
{code:java}
PodStatus(conditions=[PodCondition(lastProbeTime=null,
lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False,
type=PodScheduled, additionalProperties={})], containerStatuses=[],
ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[],
message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[],
qosClass=Guaranteed, reason=null, startTime=null, additionalProperties={}){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)