[
https://issues.apache.org/jira/browse/FLINK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706247#comment-17706247
]
xingbe commented on FLINK-31652:
--------------------------------
Hi [~xtsong] , could you please help to take a look at this ticket in your free
time, thanks!
> Flink should handle the delete event if the pod was deleted while pending
> -------------------------------------------------------------------------
>
> Key: FLINK-31652
> URL: https://issues.apache.org/jira/browse/FLINK-31652
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.17.0, 1.16.1
> Reporter: xingbe
> Priority: Major
>
> I found that in kubernetes deployment, if the taskmanager pod is deleted in
> 'Pending' phase, the flink job will get stuck and keep waiting for the pod
> scheduled. We can reproduce this issue with the 'kubectl delete pod' command
> to delete the pod when it is in the pending phase.
>
> The cause reason is that the pod status will not be updated in time in this
> case, so the KubernetesResourceManagerDriver won't detect the pod is
> terminated, and I also verified this by logging the pod status in
> KubernetesPod#isTerminated(), and it shows as follows.
> {code:java}
> public boolean isTerminated() {
> log.info("pod status: " + getInternalResource().getStatus());
> if (getInternalResource().getStatus() != null) {
> final boolean podFailed =
>
> PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase());
> final boolean containersFailed =
>
> getInternalResource().getStatus().getContainerStatuses().stream()
> .anyMatch(
> e ->
> e.getState() != null
> &&
> e.getState().getTerminated() != null);
> return containersFailed || podFailed;
> }
> return false;
> } {code}
> In the case, this function will return false because `containersFailed` and
> `podFailed` are both false.
> {code:java}
> PodStatus(conditions=[PodCondition(lastProbeTime=null,
> lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False,
> type=PodScheduled, additionalProperties={})], containerStatuses=[],
> ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[],
> message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[],
> qosClass=Guaranteed, reason=null, startTime=null,
> additionalProperties={}){code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)