[
https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824029#comment-17824029
]
chenyuzhi commented on FLINK-34576:
-----------------------------------
According to this JOSDK [
issue|https://github.com/operator-framework/java-operator-sdk/issues/2009], is
it possible that during the stopLeader process, oldLeader cannot exit within
the leaseDuration-renewDuration time window, causing oldLeader and newLeader to
update the status of flinkdeployment at the same time, resulting in a conflict.
Because the JOSDK(the version before 4.5) can't tell running instance is
actually leading or not
I think we can write a testCase to test this situation, but I am not very
familiar with operator. It seems that there are relatively few unit tests for
HA above. If I can get guidance, I would like to try it.
> Flink deployment keep staying at RECONCILING/STABLE status
> ----------------------------------------------------------
>
> Key: FLINK-34576
> URL: https://issues.apache.org/jira/browse/FLINK-34576
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.6.1
> Reporter: chenyuzhi
> Priority: Major
> Attachments: image-2024-03-05-15-13-11-032.png
>
>
> The HA mode of flink-kubernetes-operator is being used. When one of the pods
> of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the
> leader. However, some flinkdeployments have been in the
> *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
> Through the cmd "kubectl describe flinkdeployment xxx", can see the following
> error, but there are no exceptions in the flink-kubernetes-operator log.
>
> {code:java}
> Status:
> Cluster Info:
> Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00
> Flink - Version: 1.14.0-GDC1.6.0
> Total - Cpu: 7.0
> Total - Memory: 30064771072
> Error:
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException: Failed to load
> configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
> Failed to load
> configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
> to load configuration","additionalMetadata":{}}]}
> Job Manager Deployment Status: READY
> Job Status:
> Job Id: cf44b5e73a1f263dd7d9f2c82be5216d
> Job Name: noah_stream_studio_1754211682_2218100380
> Savepoint Info:
> Last Periodic Savepoint Timestamp: 0
> Savepoint History:
> Start Time: 1705635107137
> State: RECONCILING
> Update Time: 1709272530741
> Lifecycle State: STABLE {code}
>
> !image-2024-03-05-15-13-11-032.png!
>
> version:
> flink-kubernetes-operator: 1.6.1
> flink: 1.14.0/1.15.2 (flinkdeployment 1200+)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)