[jira] [Created] (FLINK-33998) Flink Job Manager restarted after kube-apiserver connection intermittent

Xiangyan (Jira) Thu, 04 Jan 2024 16:13:04 -0800

Xiangyan created FLINK-33998:
--------------------------------

             Summary: Flink Job Manager restarted after kube-apiserver 
connection intermittent
                 Key: FLINK-33998
                 URL: https://issues.apache.org/jira/browse/FLINK-33998
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes
    Affects Versions: 1.13.6
         Environment: Kubernetes 1.24


Flink Operator 1.4

Flink 1.13.6
            Reporter: Xiangyan
         Attachments: audit-log-no-restart.txt, audit-log-restart.txt, 
connection timeout.png, jm-no-restart4.log, jm-restart4.log

We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"
 
For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 05:{{{}*53*{}}}:08, 
both JM lost connection to kube-apiserver. But there is no more connection 
error within a few seconds. I guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"leadership revoked" error after the leader election timeout (at 
05:{{{}*54*{}}}:08) and then restarted itself. While the other JM was still 
running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-33998) Flink Job Manager restarted after kube-apiserver connection intermittent

Reply via email to