[jira] [Commented] (FLINK-32520) FlinkDeployment recovered states from an obsolete savepoint when performing an upgrade

Ruibin Xing (Jira) Tue, 04 Jul 2023 19:58:31 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740009#comment-17740009
 ]


Ruibin Xing commented on FLINK-32520:
-------------------------------------

[~gyfora] Sorry, after reviewing, it turns out that the second and third 
upgrades were `last-state` due to a bug in our web UI. Still the third upgrades 
did used the savepoint from the first upgrade. Since I can't provide more 
information right now I will close this issue. I'm going to change the logger 
settings and try to reproduce this issue. Thanks for your time.

> FlinkDeployment recovered states from an obsolete savepoint when performing 
> an upgrade
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-32520
>                 URL: https://issues.apache.org/jira/browse/FLINK-32520
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.13.1
>            Reporter: Ruibin Xing
>            Priority: Major
>         Attachments: flink_kubernetes_operator_0615.csv, 
> logs-06151328-06151332.csv
>
>
> Kubernetes Operator version: 1.5.0
>  
> When upgrading one of our Flink jobs, it recovered from a savepoint created 
> by the previous version of the job. The timeline of the job is as follows:
>  # I upgraded the job for the first time. The job created a savepoint and 
> successfully restored from it.
>  # The job was running fine and created several checkpoints.
>  # Later, I performed the second upgrade. Soon after submission and before 
> the JobManager stopped, I realized I made a mistake in the spec, so I quickly 
> did the third upgrade.
>  # After the job started, I found that it had recovered from the savepoint 
> created during the first upgrade.
>  
> It appears that there was an error when submitting the third upgrade. 
> However, I'm still not quite sure why this would cause Flink to use the 
> obsolete savepoint after investigating the code. The related logs for the 
> operator are attached below.
>  
> Although I haven't found the root cause, I came up with some possible fixes:
>  # Remove the {{lastSavepoint}} after a job has successfully restored from it.
>  # Add options for savepoint, similar to: 
> {{kubernetes.operator.job.upgrade.last-state.max.allowed.checkpoint.age}} The 
> operator should refuse to recover from the savepoint if the max age is 
> exceeded.
>  # Create a flag in the status that records savepoint states. Set the flag to 
> false when the savepoint starts and mark it as true when it successfully 
> ends. The job should report an error if the flag for the last savepoint is 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32520) FlinkDeployment recovered states from an obsolete savepoint when performing an upgrade

Reply via email to