[ https://issues.apache.org/jira/browse/FLINK-33998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806664#comment-17806664 ]
Matthias Pohl commented on FLINK-33998: --------------------------------------- I couldn't find anything that sounds related to your issue in the release notes of [Flink 1.14.0|https://nightlies.apache.org/flink/flink-docs-release-1.18/release-notes/flink-1.14/#runtime--coordination]. A more detailed overview of the changes is possible by browsing through all the changes of the [individual 1.14.x releases|https://issues.apache.org/jira/projects/FLINK?selectedItem=com.atlassian.jira.jira-projects-plugin:release-page&status=released&contains=1.14]. But that's quite tedious. > Flink Job Manager restarted after kube-apiserver connection intermittent > ------------------------------------------------------------------------ > > Key: FLINK-33998 > URL: https://issues.apache.org/jira/browse/FLINK-33998 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.13.6 > Environment: Kubernetes 1.24 > Flink Operator 1.4 > Flink 1.13.6 > Reporter: Xiangyan > Priority: Major > Attachments: audit-log-no-restart.txt, audit-log-restart.txt, > connection timeout.png, jm-no-restart4.log, jm-restart4.log > > > We are running Flink on AWS EKS and experienced Job Manager restart issue > when EKS control plane scaled up/in. > I can reproduce this issue in my local environment too. > Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster > by my own with below setup: > * Two kube-apiserver, only one is running at a time; > * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13); > * Enable Flink Job Manager HA; > * Configure Job Manager leader election timeout; > {code:java} > high-availability.kubernetes.leader-election.lease-duration: "60s" > high-availability.kubernetes.leader-election.renew-deadline: "60s"{code} > For testing, I switch the running kube-apiserver from one instance to another > each time. When the kube-apiserver is switching, I can see that some Job > Managers restart, but some are still running normally. > Here is an example. When kube-apiserver swatched over at > 05:{color:#ff0000}{{*53*}}{color}:08, both JM lost connection to > kube-apiserver. But there is no more connection error within a few seconds. I > guess the connection recovered by retry. > However, one of the JM (the 2nd one in the attached screen shot) reported > "DefaultDispatcherRunner was revoked the leadership" error after the leader > election timeout (at 05:{color:#ff0000}{{*54*}}{color}:08) and then restarted > itself. While the other JM was still running normally. > From kube-apiserver audit logs, the normal JM was able to renew leader lease > after the interruption. But there is no any lease renew request from the > failed JM until it restarted. > -- This message was sent by Atlassian Jira (v8.20.10#820010)