[ 
https://issues.apache.org/jira/browse/FLINK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516578#comment-17516578
 ] 

Gyula Fora commented on FLINK-26140:
------------------------------------

One straightforward way to implement this would be to add a new field to the 
status called *lastStableSpec* .

lastStableSpec would be somewhat similar to lastReconciledSpec but while 
lastReconciledSpec is updated by the reconciler, lastStableSpec should be 
updated by the observer based on some stability condition.

We could start with a simple checkpoint condition where a spec would be marked 
stable if the resulting job run has completed 1 successful 
checkpoint/savepoint. Later we can add user configurable stability conditions 
but this is a good start.

Once we have the *lastStableSpec* field working, we could introduce the 
rollback strategy. If rollback is enabled, any deployment errors (not 
reconciliation errors) the job would be rolled back to the *lastStableSpec .* 
For executing the rollback we can reuse the logic from the reconciler with some 
slight modifications.

[~wangyang0918] [~aitozi] wdyt?

> Add basic handling mechanism to deal with job upgrade errors
> ------------------------------------------------------------
>
>                 Key: FLINK-26140
>                 URL: https://issues.apache.org/jira/browse/FLINK-26140
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Priority: Major
>             Fix For: kubernetes-operator-1.0.0
>
>
> There are various different ways how a stateful job upgrade can fail.
> For example:
> - Failure/timeout during savepoint
> - Incompatible state
> - Corrupted / not-found checkpoint
> - Error after restart
> We should allow some strategies for the user to declare how to handle the 
> different error scenarios (such as roll back to earlier state) and what 
> should be treated as a fatal error.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to