[
https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chenyuzhi updated FLINK-34576:
------------------------------
Description:
The HA mode of flink-kubernetes-operator is being used. When one of the pods of
flink-kubernetes-operator restarts, flink-kubernetes-operator switches the
leader. However, some flinkdeployments have been in the
*JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
Through the cmd "kubectl describe flinkdeployment xxx", can see the following
error, but there are no exceptions in the flink-kubernetes-operator log.
{code:java}
Status:
Cluster Info:
Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00
Flink - Version: 1.14.0-GDC1.6.0
Total - Cpu: 7.0
Total - Memory: 30064771072
Error:
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.RuntimeException: Failed to load
configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
Failed to load
configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
to load configuration","additionalMetadata":{}}]}
Job Manager Deployment Status: READY
Job Status:
Job Id: cf44b5e73a1f263dd7d9f2c82be5216d
Job Name: noah_stream_studio_1754211682_2218100380
Savepoint Info:
Last Periodic Savepoint Timestamp: 0
Savepoint History:
Start Time: 1705635107137
State: RECONCILING
Update Time: 1709272530741
Lifecycle State: STABLE {code}
!image-2024-03-05-15-13-11-032.png!
version:
flink-kubernetes-operator: 1.6.1
flink: 1.14.0/1.15.2
作业规模:
flinkdeployment 1200+
[~gyfora]
was:
The HA mode of flink-kubernetes-operator is being used. When one of the pods of
flink-kubernetes-operator restarts, flink-kubernetes-operator switches the
leader. However, some flinkdeployments have been in the
*JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
Through the cmd "kubectl describe flinkdeployment xxx", can see the following
error, but there are no exceptions in the flink-kubernetes-operator log.
{code:java}
Status:
Cluster Info:
Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00
Flink - Version: 1.14.0-GDC1.6.0
Total - Cpu: 7.0
Total - Memory: 30064771072
Error:
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.RuntimeException: Failed to load
configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
Failed to load
configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
to load configuration","additionalMetadata":{}}]}
Job Manager Deployment Status: READY
Job Status:
Job Id: cf44b5e73a1f263dd7d9f2c82be5216d
Job Name: noah_stream_studio_1754211682_2218100380
Savepoint Info:
Last Periodic Savepoint Timestamp: 0
Savepoint History:
Start Time: 1705635107137
State: RECONCILING
Update Time: 1709272530741
Lifecycle State: STABLE {code}
!image-2024-03-05-15-13-11-032.png!
版本:
flink-kubernetes-operator: 1.6.1
flink: 1.14.0/1.15.2
[~gyfora]
> Flink deployment keep staying at RECONCILING/STABLE status
> ----------------------------------------------------------
>
> Key: FLINK-34576
> URL: https://issues.apache.org/jira/browse/FLINK-34576
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.6.1
> Reporter: chenyuzhi
> Priority: Major
> Attachments: image-2024-03-05-15-13-11-032.png
>
>
> The HA mode of flink-kubernetes-operator is being used. When one of the pods
> of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the
> leader. However, some flinkdeployments have been in the
> *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
> Through the cmd "kubectl describe flinkdeployment xxx", can see the following
> error, but there are no exceptions in the flink-kubernetes-operator log.
>
> {code:java}
> Status:
> Cluster Info:
> Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00
> Flink - Version: 1.14.0-GDC1.6.0
> Total - Cpu: 7.0
> Total - Memory: 30064771072
> Error:
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException: Failed to load
> configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
> Failed to load
> configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
> to load configuration","additionalMetadata":{}}]}
> Job Manager Deployment Status: READY
> Job Status:
> Job Id: cf44b5e73a1f263dd7d9f2c82be5216d
> Job Name: noah_stream_studio_1754211682_2218100380
> Savepoint Info:
> Last Periodic Savepoint Timestamp: 0
> Savepoint History:
> Start Time: 1705635107137
> State: RECONCILING
> Update Time: 1709272530741
> Lifecycle State: STABLE {code}
>
> !image-2024-03-05-15-13-11-032.png!
>
> version:
> flink-kubernetes-operator: 1.6.1
> flink: 1.14.0/1.15.2
>
> 作业规模:
> flinkdeployment 1200+
> [~gyfora]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)