Ruibin Xing created FLINK-33011:
-----------------------------------
Summary: Operator deletes HA data unexpectedly
Key: FLINK-33011
URL: https://issues.apache.org/jira/browse/FLINK-33011
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.0, 1.17.1
Environment: Flink: 1.17.1
Flink Kubernetes Operator: 1.6.0
Reporter: Ruibin Xing
Attachments: flink_operator_logs_0831.csv
We encountered a problem where the operator unexpectedly deleted HA data.
The timeline is as follows:
12:08 We submitted the first spec, which suspended the job with savepoint
upgrade mode.
12:08 The job was suspended, while the HA data was preserved, and the log
showed the observed job deployment status was MISSING.
12:10 We submitted the second spec, which deployed the job with the last state
upgrade mode.
12:10 Logs showed the operator deleted both the Flink deployment and the HA
data again.
12:10 The job failed to start because the HA data was missing.
According to the log, the deletion was triggered by
https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168
I think this would only be triggered if the job deployment status wasn't
MISSING. But the log before the deletion showed the observed job status was
MISSING at that moment.
Related logs:
{code:java}
2023-08-30 12:08:48.190 +0000 o.a.f.k.o.s.AbstractFlinkService [INFO
][default/pipeline-pipeline-se-3] Cluster shutdown completed.
2023-08-30 12:10:27.010 +0000 o.a.f.k.o.o.d.ApplicationObserver [INFO
][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous
status: MISSING
2023-08-30 12:10:27.533 +0000 o.a.f.k.o.l.AuditUtils [INFO
][default/pipeline-pipeline-se-3] >>> Event | Info | SPECCHANGED |
UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image :
docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362
->
docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365,
podTemplate.metadata.labels.app.kubernetes.io~1version :
0835137cd803b7258695eb53a6ec520cb62a48a7 ->
23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running,
job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]),
starting reconciliation.
2023-08-30 12:10:27.679 +0000 o.a.f.k.o.s.NativeFlinkService [INFO
][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA
metadata.
{code}
A more complete log file is attached. Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)