[ 
https://issues.apache.org/jira/browse/FLINK-33096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767595#comment-17767595
 ] 

wawa commented on FLINK-33096:
------------------------------

So sorry,correct the previous description: 

 there was an exception during the scheduling of the taskManager, and the 
restart strategy took effect. As a result, the Duration value was reset and 
started counting again.

However, due to pod memory overuse, the taskManager pod was killed and evicted 
by Kubernetes. When trying to schedule a new taskManager pod, it failed, 
resulting in the failure of the entire Flink job.

> Flink on k8s,if one taskmanager pod was crashed,the whole flink job will be 
> failed
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-33096
>                 URL: https://issues.apache.org/jira/browse/FLINK-33096
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.14.3
>            Reporter: wawa
>            Priority: Major
>
> The Flink version is 1.14.3, and the job is submitted to Kubernetes using the 
> Native Kubernetes application mode. During the scheduling process, when a 
> TaskManager pod crashes due to an exception, Kubernetes will attempt to start 
> a new TaskManager pod. However, the scheduling process is halted immediately, 
> resulting in the entire Flink job being terminated. On the other hand, if the 
> JobManager pod crashes, Kubernetes is able to successfully schedule a new 
> JobManager pod. This observation was made during application usage. Can you 
> please help analyze the underlying issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to