[
https://issues.apache.org/jira/browse/FLINK-33011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rui Fan updated FLINK-33011:
----------------------------
Fix Version/s: kubernetes-operator-1.6.1
> Operator deletes HA data unexpectedly
> -------------------------------------
>
> Key: FLINK-33011
> URL: https://issues.apache.org/jira/browse/FLINK-33011
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.17.1, kubernetes-operator-1.6.0
> Environment: Flink: 1.17.1
> Flink Kubernetes Operator: 1.6.0
> Reporter: Ruibin Xing
> Assignee: Gyula Fora
> Priority: Blocker
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.7.0, kubernetes-operator-1.6.1
>
> Attachments: flink_operator_logs_0831.csv
>
>
> We encountered a problem where the operator unexpectedly deleted HA data.
> The timeline is as follows:
> 12:08 We submitted the first spec, which suspended the job with savepoint
> upgrade mode.
> 12:08 The job was suspended, while the HA data was preserved, and the log
> showed the observed job deployment status was MISSING.
> 12:10 We submitted the second spec, which deployed the job with the last
> state upgrade mode.
> 12:10 Logs showed the operator deleted both the Flink deployment and the HA
> data again.
> 12:10 The job failed to start because the HA data was missing.
> According to the log, the deletion was triggered by
> https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168
> I think this would only be triggered if the job deployment status wasn't
> MISSING. But the log before the deletion showed the observed job status was
> MISSING at that moment.
> Related logs:
>
> {code:java}
> 2023-08-30 12:08:48.190 +0000 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][default/pipeline-pipeline-se-3] Cluster shutdown completed.
> 2023-08-30 12:10:27.010 +0000 o.a.f.k.o.o.d.ApplicationObserver [INFO
> ][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous
> status: MISSING
> 2023-08-30 12:10:27.533 +0000 o.a.f.k.o.l.AuditUtils [INFO
> ][default/pipeline-pipeline-se-3] >>> Event | Info | SPECCHANGED |
> UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image :
> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362
> ->
> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365,
> podTemplate.metadata.labels.app.kubernetes.io~1version :
> 0835137cd803b7258695eb53a6ec520cb62a48a7 ->
> 23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running,
> job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]),
> starting reconciliation.
> 2023-08-30 12:10:27.679 +0000 o.a.f.k.o.s.NativeFlinkService [INFO
> ][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA
> metadata.
> {code}
> A more complete log file is attached. Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)