Prashant Bhardwaj created FLINK-39704:
-----------------------------------------
Summary: Kubernetes HA can recover a globally terminal FAILED
application job after leadership revoke/reacquire
Key: FLINK-39704
URL: https://issues.apache.org/jira/browse/FLINK-39704
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 2.2.0, 2.4.0
Reporter: Prashant Bhardwaj
Attachments: jm-ha-reanimation-repro-current-2026-05-18.log,
jm-ha-reanimation-repro-events-2026-05-18.txt,
jm-ha-reanimation-repro-pod-describe-2026-05-18.txt,
jm-ha-reanimation-repro-previous-2026-05-18.log
In a Kubernetes HA application cluster, a job that has already reached the
globally terminal FAILED state can be recovered and restarted with the same
JobID if Kubernetes leadership is revoked/reacquired immediately after the
terminal transition.
Observed with apache/flink:2.2.0 and Kubernetes HA.
*Timeline from repro:*
{noformat}
20:52:51.075 Task failure after TaskManager deletion
20:52:51.119 Job e7ce38da0a5b4651ce64453d6ffaa25b switched RUNNING -> FAILING
20:52:51.122 Job e7ce38da0a5b4651ce64453d6ffaa25b switched FAILING -> FAILED
20:52:52.615 KubernetesLeaderElector observed empty leader holder
20:52:52.616 Leadership revoked
20:52:52.618 Dispatcher reported same job as terminal SUSPENDED
20:52:52.921 DefaultExecutionPlanStore released execution plan
e7ce38da0a5b4651ce64453d6ffaa25b
20:52:52.926 Same job id was retrieved from KubernetesStateHandleStore
20:52:53.035 Same StreamGraph(jobId: e7ce38da0a5b4651ce64453d6ffaa25b) was
recovered
20:53:11.340 Same job switched CREATED -> RUNNING
{noformat}
*Expected:*
Once a job reaches globally terminal FAILED, later leadership revocation/close
should not overwrite or mask the globally terminal result as SUSPENDED. HA
metadata should be cleaned up as a globally terminal job, and the same job
should not be recovered.
*Actual:*
Leadership revocation closes the running JobMaster/Dispatcher path with
synthetic SUSPENDED after the real FAILED result. The execution plan is
released rather than permanently removed, so the same job id remains
recoverable from Kubernetes HA storage and is started again.
A secondary issue is also visible in the same churn window:
DefaultLeaderElectionService receives a grant while issuedLeaderSessionID is
already set and throws:
java.lang.IllegalStateException:
The leadership should have been granted while not having the leadership
acquired.
This crashes the JobManager entrypoint, but the reanimation has already
happened before the fatal error: the failed job was released/recovered from HA
metadata.
*Reproduction outline:*
1. Run a Kubernetes HA application cluster with restart-strategy.type: none.
2. Use a persistent HA storage dir.
3. Delete the TaskManager so the job reaches FAILED.
4. Immediately after observing RUNNING -> FAILING, patch the cluster leader
ConfigMap annotation holderIdentity to empty, forcing leadership loss/reacquire.
5. Observe FAILED followed by SUSPENDED/release/recovery of the same JobID.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)