[
https://issues.apache.org/jira/browse/FLINK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39165:
-----------------------------------
Labels: pull-request-available (was: )
> Concurrent FlinkDeployment and FlinkSessionJob upgrades lead to savepoint
> failure and state loss
> -------------------------------------------------------------------------------------------------
>
> Key: FLINK-39165
> URL: https://issues.apache.org/jira/browse/FLINK-39165
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Ihor Mielientiev
> Priority: Major
> Labels: pull-request-available
>
> When a FlinkDeployment (session cluster) and its associated
> FlinkSessionJob(s) are both updated simultaneously, the operator reconciles
> them concurrently. This leads to conflicting operations and non-deterministic
> behavior.
> Symptoms:
> * Updating a FlinkSessionJob triggers the operator to savepoint the running
> job and cancel it.
> Simultaneously, updating the FlinkDeployment (e.g., image change) triggers
> the operator to restart the session cluster (delete/recreate JM and TM pods).
> * When these happen in parallel:
> ** The in-progress savepoint fails because the cluster is torn down
> underneath it.
> ** The running job's state is lost (no successful savepoint was taken).
> ** After both upgrades complete, the JobManager is running but has no active
> jobs the session job was neither gracefully stopped nor automatically
> resubmitted. (Job Not Found error)
>
> The result is non-deterministic: the outcome depends entirely on the
> scheduling order of the two controllers, which is not user-controllable.
>
> Expected behavior: when a FlinkDeployment and its FlinkSessionJob(s) are
> updated concurrently, the operator should serialize the upgrades ensuring the
> session job gracefully stops (with a savepoint) before or after the cluster
> upgrade, not simultaneously.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)