[
https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823663#comment-17823663
]
chenyuzhi commented on FLINK-34576:
-----------------------------------
Thanks for the reply.
1. Is there a way to somehow repro this on a smaller case?
I have tried to simulate leader switching by deleting pod in the test
environment, but without repro. In the production environment, it is very
likely to occur (maybe it is related to the load?).
Maybe there is some way to make the operator pod lost the leader to repro(not
delete pod, but I haven't found any other way to make the pod lost the leader)
2. Have you tried operator version 1.7.0? We may have fixed the issue there
already
It has not been upgraded to use 1.7.0 because this version no longer supports
Flink1.14.0, but our production environment is still using it.
Are you pointing about this [JOSDK
issue|https://github.com/operator-framework/java-operator-sdk/issues/2056]? We
did encounter a split-brain problem similar to multiple leaders earlier, but As
mentioned in the first question, this status exception will still occur after
the master is successfully switched (by checking the log oldLeader exit,
newLeader takeover).
3. Does it also affect newer Flink versions as well?
Our highest Flink version is 1.15.2, and the impact of higher versions is
uncertain.
4. Can you share some relevant operator logs?
Sure.
operatorA log when leader switches (stopping leader appears), take it from
log-file
{code:java}
2024-03-05 04:35:46,565 o.a.f.c.Configuration [WARN
][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
2024-03-05 04:35:46,567 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
2024-03-05 04:35:46,569 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-erie-erie-gzailab-sym2-ns-imageveri] Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
2024-03-05 04:35:46,569 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/test-vk-log3] Config uses deprecated configuration key
'high-availability' instead of proper key 'high-availability.type'
2024-03-05 04:35:46,574 i.j.o.LeaderElectionManager [INFO ] New leader with
identity:
2024-03-05 04:35:46,584 o.a.f.c.Configuration [WARN
][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Config uses
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,584 o.a.f.c.Configuration [WARN
][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Config uses
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,586 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO
][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Resource fully
reconciled, nothing to do...
2024-03-05 04:35:46,586 i.j.o.LeaderElectionManager [INFO ] Stopped leading
for identity: flink-kubernetes-operator-85f6994468-cpsx9. Exiting.
2024-03-05 04:35:46,589 o.a.f.k.o.l.AuditUtils [INFO
][gdc-gdc-bu/test-lag-202306-v2-copy-cpu] >>> Status | Error | STABLE
|
{"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
metadata not available to restore from last state. It is possible that the job
has finished or terminally failed, or the configmaps have been deleted. Manual
restore required.","additionalMetadata":{},"throwableList":[]}
2024-03-05 04:35:46,591 o.a.f.c.Configuration [WARN
][gdc-a29-bu/logdistribution-xia-xia-a29-pc-vm-log-product] Config uses
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,591 o.a.f.c.Configuration [WARN
][gdc-a29-bu/logdistribution-xia-xia-a29-pc-vm-log-product] Config uses
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-grand-grand-s8-serverlog-production] Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-jinghang-jinghang-g106-seazyi-nginx] Config uses
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-jinghang-jinghang-g106-seazyi-nginx] Config uses
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-jinghang-jinghang-artct-outer-p4-se] Config uses
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-jinghang-jinghang-artct-outer-p4-se] Config uses
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN
][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,593 o.a.f.c.Configuration [WARN
][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,593 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,593 o.a.f.c.Configuration [WARN
][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key
'kubernetes.taskmanager.cpu.amount' {code}
OperatorB log when switching, take it from es (the format is a little different
from the above log file)
{code:java}
-- Meters ---------------------------------------------------------------------
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.NumPerSecond:
0.35
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpRequest.NumPerSecond:
0.35
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.201.NumPerSecond:
0.0
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.200.NumPerSecond:
0.3333333333333333
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.101.NumPerSecond:
0.016666666666666666
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpRequest.Failed.NumPerSecond:
0.0-- Histograms
---------------------------------------------------------------------
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.TimeNanos:
count=1000, min=939944, max=49957558, mean=1717875.3819999998,
stddev=2272368.0974273267, p50=1293964.5, p75=1475059.25,
p95=3561530.249999989, p98=6726813.320000002, p99=8400899.390000004,
p999=4.9932472127003446E7=========================== Finished metrics report
==========================="
2024-03-04T20:35:48.416Z,"2024-03-05 04:35:48,027 INFO
io.javaoperatorsdk.operator.LeaderElectionManager - New leader with
identity:
"
2024-03-04T20:35:48.416Z,"2024-03-05 04:35:48,121 INFO
io.javaoperatorsdk.operator.LeaderElectionManager - New leader with
identity: flink-kubernetes-operator-85f6994468-92xsz
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,126 INFO
io.javaoperatorsdk.operator.processing.Controller - Started event
processing for controller: flinksessionjobcontroller
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN
org.apache.flink.configuration.Configuration
[gdc-gdc-sa/logstream-wei-ma65-production] - Config uses deprecated
configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN
org.apache.flink.configuration.Configuration
[gdc-nsh-bu/logdistribution-kiel-kiel-nsh-lhall-eos-produ] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,905 WARN
org.apache.flink.configuration.Configuration
[gdc-nsh-bu/logdistribution-kiel-kiel-nsh-lhall-eos-produ] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,905 WARN
org.apache.flink.configuration.Configuration
[gdc-a29-bu/logdistribution-tang-tang-a29-zycenter-hub-pr] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN
org.apache.flink.configuration.Configuration
[gdc-g117-bu/logdistribution-welland-welland-g117-serverlo] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,901 WARN
org.apache.flink.configuration.Configuration
[gdc-gdc-sa/logstream-jinghang-jinghang-opd-java-fs-log-p] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,913 WARN
org.apache.flink.configuration.Configuration
[gdc-g117-bu/logdistribution-welland-welland-g117-serverlo] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,921 WARN
org.apache.flink.configuration.Configuration
[gdc-qdata-bu/prod-g17-reward-dynamic-huodongchangzhuhuodon] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN
org.apache.flink.configuration.Configuration
[gdc-qdata-bu/prod-g48-monitor-reward-xinzengdaojujiankong] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:48,920 WARN
org.apache.flink.configuration.Configuration
[gdc-qdata-bu/prod-g17-reward-dynamic-reward-huodongchangzh] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN
org.apache.flink.configuration.Configuration
[gdc-gdc-sa/logstream-jinghang-jinghang-opd-java-fs-log-p] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:48,919 WARN
org.apache.flink.configuration.Configuration
[gdc-gdc-sa/logstream-panama-panama-h72-hexfps-proxima-pr] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN
org.apache.flink.configuration.Configuration
[gdc-qdata-bu/prod-g17-reward-dynamic-reward-huodongchangzh] - Config uses
deprecated configuration key 'high-availability' instead of proper key
'high-availability.type' {code}
> Flink deployment keep staying at RECONCILING/STABLE status
> ----------------------------------------------------------
>
> Key: FLINK-34576
> URL: https://issues.apache.org/jira/browse/FLINK-34576
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.6.1
> Reporter: chenyuzhi
> Priority: Major
> Attachments: image-2024-03-05-15-13-11-032.png
>
>
> The HA mode of flink-kubernetes-operator is being used. When one of the pods
> of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the
> leader. However, some flinkdeployments have been in the
> *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
> Through the cmd "kubectl describe flinkdeployment xxx", can see the following
> error, but there are no exceptions in the flink-kubernetes-operator log.
>
> {code:java}
> Status:
> Cluster Info:
> Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00
> Flink - Version: 1.14.0-GDC1.6.0
> Total - Cpu: 7.0
> Total - Memory: 30064771072
> Error:
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException: Failed to load
> configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
> Failed to load
> configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
> to load configuration","additionalMetadata":{}}]}
> Job Manager Deployment Status: READY
> Job Status:
> Job Id: cf44b5e73a1f263dd7d9f2c82be5216d
> Job Name: noah_stream_studio_1754211682_2218100380
> Savepoint Info:
> Last Periodic Savepoint Timestamp: 0
> Savepoint History:
> Start Time: 1705635107137
> State: RECONCILING
> Update Time: 1709272530741
> Lifecycle State: STABLE {code}
>
> !image-2024-03-05-15-13-11-032.png!
>
> version:
> flink-kubernetes-operator: 1.6.1
> flink: 1.14.0/1.15.2 (flinkdeployment 1200+)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)