[
https://issues.apache.org/jira/browse/FLINK-33998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangyan updated FLINK-33998:
-----------------------------
Description:
We are running Flink on AWS EKS and experienced Job Manager restart issue when
EKS control plane scaled up/in.
I can reproduce this issue in my local environment too.
Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by
my own with below setup:
* Two kube-apiserver, only one is running at a time;
* Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
* Enable Flink Job Manager HA;
* Configure Job Manager leader election timeout;
{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
For testing, I switch the running kube-apiserver from one instance to another
each time. When the kube-apiserver is switching, I can see that some Job
Managers restart, but some are still running normally.
Here is an example. When kube-apiserver swatched over at
05:{color:#ff0000}{{*53*}}{color}:08, both JM lost connection to
kube-apiserver. But there is no more connection error within a few seconds. I
guess the connection recovered by retry.
However, one of the JM (the 2nd one in the attached screen shot) reported
"{{{}DefaultDispatcherRunner was revoked the leadership{}}}" error after the
leader election timeout (at 05:{color:#ff0000}{{*54*}}{color}:08) and then
restarted itself. While the other JM was still running normally.
>From kube-apiserver audit logs, the normal JM was able to renew leader lease
>after the interruption. But there is no any lease renew request from the
>failed JM until it restarted.
was:
We are running Flink on AWS EKS and experienced Job Manager restart issue when
EKS control plane scaled up/in.
I can reproduce this issue in my local environment too.
Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by
my own with below setup:
* Two kube-apiserver, only one is running at a time;
* Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
* Enable Flink Job Manager HA;
* Configure Job Manager leader election timeout;
{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
For testing, I switch the running kube-apiserver from one instance to another
each time. When the kube-apiserver is switching, I can see that some Job
Managers restart, but some are still running normally.
Here is an example. When kube-apiserver swatched over at
05:{color:#FF0000}{{*53*}}{color}:08, both JM lost connection to
kube-apiserver. But there is no more connection error within a few seconds. I
guess the connection recovered by retry.
However, one of the JM (the 2nd one in the attached screen shot) reported
"leadership revoked" error after the leader election timeout (at
05:{color:#FF0000}{{*54*}}{color}:08) and then restarted itself. While the
other JM was still running normally.
>From kube-apiserver audit logs, the normal JM was able to renew leader lease
>after the interruption. But there is no any lease renew request from the
>failed JM until it restarted.
> Flink Job Manager restarted after kube-apiserver connection intermittent
> ------------------------------------------------------------------------
>
> Key: FLINK-33998
> URL: https://issues.apache.org/jira/browse/FLINK-33998
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.13.6
> Environment: Kubernetes 1.24
> Flink Operator 1.4
> Flink 1.13.6
> Reporter: Xiangyan
> Priority: Major
> Attachments: audit-log-no-restart.txt, audit-log-restart.txt,
> connection timeout.png, jm-no-restart4.log, jm-restart4.log
>
>
> We are running Flink on AWS EKS and experienced Job Manager restart issue
> when EKS control plane scaled up/in.
> I can reproduce this issue in my local environment too.
> Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster
> by my own with below setup:
> * Two kube-apiserver, only one is running at a time;
> * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
> * Enable Flink Job Manager HA;
> * Configure Job Manager leader election timeout;
> {code:java}
> high-availability.kubernetes.leader-election.lease-duration: "60s"
> high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
> For testing, I switch the running kube-apiserver from one instance to another
> each time. When the kube-apiserver is switching, I can see that some Job
> Managers restart, but some are still running normally.
> Here is an example. When kube-apiserver swatched over at
> 05:{color:#ff0000}{{*53*}}{color}:08, both JM lost connection to
> kube-apiserver. But there is no more connection error within a few seconds. I
> guess the connection recovered by retry.
> However, one of the JM (the 2nd one in the attached screen shot) reported
> "{{{}DefaultDispatcherRunner was revoked the leadership{}}}" error after the
> leader election timeout (at 05:{color:#ff0000}{{*54*}}{color}:08) and then
> restarted itself. While the other JM was still running normally.
> From kube-apiserver audit logs, the normal JM was able to renew leader lease
> after the interruption. But there is no any lease renew request from the
> failed JM until it restarted.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)