[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #165: [FLINK-26140] Support rollback strategies

GitBox Thu, 14 Apr 2022 03:21:08 -0700


gyfora commented on code in PR #165:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/165#discussion_r850305005



##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##########
@@ -102,36 +111,84 @@ public void reconcile(FlinkDeployment flinkApp, Context 
context, Configuration e
             }
             if (currentJobState == JobState.SUSPENDED && desiredJobState == 
JobState.RUNNING) {
                 if (upgradeMode == UpgradeMode.STATELESS) {
-                    deployFlinkJob(flinkApp, effectiveConfig, 
Optional.empty());
-                } else if (upgradeMode == UpgradeMode.LAST_STATE
-                        || upgradeMode == UpgradeMode.SAVEPOINT) {
-                    restoreFromLastSavepoint(flinkApp, effectiveConfig);
+                    deployFlinkJob(currentJobSpec, status, effectiveConfig, 
Optional.empty());
+                } else {
+                    restoreFromLastSavepoint(currentJobSpec, status, 
effectiveConfig);
                 }
                 stateAfterReconcile = JobState.RUNNING;
             }
-            IngressUtils.updateIngressRules(flinkApp, effectiveConfig, 
kubernetesClient);
+            IngressUtils.updateIngressRules(
+                    deployMeta, currentDeploySpec, effectiveConfig, 
kubernetesClient);
             ReconciliationUtils.updateForSpecReconciliationSuccess(flinkApp, 
stateAfterReconcile);
-        } else if (SavepointUtils.shouldTriggerSavepoint(flinkApp) && 
isJobRunning(flinkApp)) {
+        } else if (ReconciliationUtils.shouldRollBack(reconciliationStatus, 
effectiveConfig)) {
+            rollbackApplication(flinkApp);
+        } else if (SavepointUtils.shouldTriggerSavepoint(currentJobSpec, 
status)
+                && isJobRunning(status)) {
             triggerSavepoint(flinkApp, effectiveConfig);
             ReconciliationUtils.updateSavepointReconciliationSuccess(flinkApp);
+        } else {
+            LOG.info("Deployment is fully reconciled, nothing to do.");
         }
     }
 
+    private void rollbackApplication(FlinkDeployment flinkApp) throws 
Exception {
+        ReconciliationStatus reconciliationStatus = 
flinkApp.getStatus().getReconciliationStatus();
+
+        if (reconciliationStatus.getState() != 
ReconciliationStatus.State.ROLLING_BACK) {
+            LOG.warn("Preparing to roll back to last stable spec.");
+            if (flinkApp.getStatus().getError() == null) {
+                flinkApp.getStatus()
+                        .setError(
+                                "Deployment is not ready within the configured 
timeout, rolling-back.");
+            }
+            
reconciliationStatus.setState(ReconciliationStatus.State.ROLLING_BACK);
+            return;
+        }
+
+        LOG.warn("Executing roll-back operation");
+
+        FlinkDeploymentSpec rollbackSpec = 
reconciliationStatus.deserializeLastStableSpec();
+        Configuration rollbackConfig =
+                FlinkUtils.getEffectiveConfig(flinkApp.getMetadata(), 
rollbackSpec, defaultConfig);
+
+        UpgradeMode upgradeMode = flinkApp.getSpec().getJob().getUpgradeMode();
+
+        suspendJob(

Review Comment:
   I am afraid it is not that simple and it is by design like this. Let me try 
to explain.
   
   I think we earlier decided against making changes to the spec itself as it 
breaks the idempotency of the operator logic. If we simply change the spec then 
an external process might simply reapply what was previous rolled back causing 
a loop.
   
   This is also the reason why we do not upgrade the `lastReconciledSpec`. That 
field shows the last spec that the user have sent in that was reconciled by the 
controller. Even if it was rolled back later that does not change.
   
   The ReconciliationStatus.State shows what actually happened to the 
lastReconciledSpec: DEPLOYED means it is running on the cluster, 
ROLLING_BACK/ROLLED_BACK actually signal that it was reverted after it was 
DEPLOYED.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #165: [FLINK-26140] Support rollback strategies

Reply via email to