[ https://issues.apache.org/jira/browse/FLINK-33096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767595#comment-17767595 ]
wawa commented on FLINK-33096: ------------------------------ So sorry,correct the previous description: there was an exception during the scheduling of the taskManager, and the restart strategy took effect. As a result, the Duration value was reset and started counting again. However, due to pod memory overuse, the taskManager pod was killed and evicted by Kubernetes. When trying to schedule a new taskManager pod, it failed, resulting in the failure of the entire Flink job. > Flink on k8s,if one taskmanager pod was crashed,the whole flink job will be > failed > ---------------------------------------------------------------------------------- > > Key: FLINK-33096 > URL: https://issues.apache.org/jira/browse/FLINK-33096 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.14.3 > Reporter: wawa > Priority: Major > > The Flink version is 1.14.3, and the job is submitted to Kubernetes using the > Native Kubernetes application mode. During the scheduling process, when a > TaskManager pod crashes due to an exception, Kubernetes will attempt to start > a new TaskManager pod. However, the scheduling process is halted immediately, > resulting in the entire Flink job being terminated. On the other hand, if the > JobManager pod crashes, Kubernetes is able to successfully schedule a new > JobManager pod. This observation was made during application usage. Can you > please help analyze the underlying issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)