[jira] [Updated] (FLINK-39165) Concurrent FlinkDeployment and FlinkSessionJob upgrades lead to savepoint failure and state loss

ASF GitHub Bot (Jira) Thu, 05 Mar 2026 09:43:07 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated FLINK-39165:
-----------------------------------
    Labels: pull-request-available  (was: )

> Concurrent FlinkDeployment and FlinkSessionJob upgrades lead to savepoint 
> failure and state loss 
> -------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39165
>                 URL: https://issues.apache.org/jira/browse/FLINK-39165
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Ihor Mielientiev
>            Priority: Major
>              Labels: pull-request-available
>
> When a FlinkDeployment (session cluster) and its associated 
> FlinkSessionJob(s) are both updated simultaneously, the operator reconciles 
> them concurrently. This leads to conflicting operations and non-deterministic 
> behavior.
> Symptoms:
>  * Updating a FlinkSessionJob triggers the operator to savepoint the running 
> job and cancel it.
> Simultaneously, updating the FlinkDeployment (e.g., image change) triggers 
> the operator to restart the session cluster (delete/recreate JM and TM pods).
>  * When these happen in parallel:
>  ** The in-progress savepoint fails because the cluster is torn down 
> underneath it.
>  ** The running job's state is lost (no successful savepoint was taken).
>  ** After both upgrades complete, the JobManager is running but has no active 
> jobs the session job was neither gracefully stopped nor automatically 
> resubmitted. (Job Not Found error)
>  
> The result is non-deterministic: the outcome depends entirely on the 
> scheduling order of  the two controllers, which is not user-controllable.
>  
> Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are 
> updated concurrently, the operator should serialize the upgrades ensuring the 
> session job gracefully stops (with a savepoint) before or after the cluster 
> upgrade, not simultaneously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39165) Concurrent FlinkDeployment and FlinkSessionJob upgrades lead to savepoint failure and state loss

Reply via email to