[
https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-32012.
------------------------------
Fix Version/s: kubernetes-operator-1.6.0
Resolution: Fixed
merged to main d346ca9c437d20042ed8f4a1954f0f0ed438b3ae
> Operator failed to rollback due to missing HA metadata
> ------------------------------------------------------
>
> Key: FLINK-32012
> URL: https://issues.apache.org/jira/browse/FLINK-32012
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.4.0
> Reporter: Nicolas Fraison
> Priority: Major
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.6.0
>
>
> The operator has well detected that the job was failing and initiate the
> rollback but this rollback has failed due to `Rollback is not possible due to
> missing HA metadata`
> We are relying on saevpoint upgrade mode and zookeeper HA.
> The operator is performing a set of action to also delete this HA data in
> savepoint upgrade mode:
> * [flink-kubernetes-operator/AbstractFlinkService.java at main ·
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
> : Suspend job with savepoint and deleteClusterDeployment
> * [flink-kubernetes-operator/StandaloneFlinkService.java at main ·
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
> : Remove JM + TM deployment and delete HA data
> * [flink-kubernetes-operator/AbstractFlinkService.java at main ·
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
> : Wait cluster shutdown and delete zookeeper HA data
> * [flink-kubernetes-operator/FlinkUtils.java at main ·
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
> : Remove all child znode
> Then when running rollback the operator is looking for HA data even if we
> rely on sevepoint upgrade mode:
> * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main ·
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
> Perform reconcile of rollback if it should rollback
> * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main ·
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
> Rollback failed as HA data is not available
> * [flink-kubernetes-operator/FlinkUtils.java at main ·
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220]
> Check if some child znodes are available
> For both step the pattern looks to be the same for kubernetes HA so it
> doesn't looks to be linked to a bug with zookeeper.
>
> From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be
> expected that the HA data has been deleted (as it is also performed by flink
> when relying on savepoint upgrade mode).
> Still the use case seems to differ from
> https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of
> the failure and treat a specific rollback event.
> So I'm wondering why we enforce such a check when performing rollback if we
> rely on savepoint upgrade mode. Would it be fine to not rely on the HA data
> and rollback from the last savepoint (the one we used in the deployment step)?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)