[jira] [Comment Edited] (FLINK-39704) Kubernetes HA can recover a globally terminal FAILED application job after leadership revoke/reacquire

Prashant Bhardwaj (Jira) Tue, 19 May 2026 07:25:01 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-39704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082065#comment-18082065
 ]


Prashant Bhardwaj edited comment on FLINK-39704 at 5/19/26 2:22 PM:
--------------------------------------------------------------------

It's not K8s-specific or Application clusters specific. 
{{ZooKeeperLeaderElectionDriver.notLeader()}} feeds the same 
{{onRevokeLeadership}} path into {{{}JobMasterServiceLeadershipRunner{}}}, so a 
Curator session expiry (or any latch loss) during result-future completion can 
hit the same race. I'll update the description, happy to update the title too 
to reflect that this hits ZK HA as well, let me know if you'd prefer a specific 
wording.


was (Author: prbhardwaj):
It's not K8s-specific or Application clusters specific. 
{{ZooKeeperLeaderElectionDriver.notLeader()}} feeds the same 
{{onRevokeLeadership}} path into {{{}JobMasterServiceLeadershipRunner{}}}, so a 
Curator session expiry (or any latch loss) during result-future completion 
produces the identical race. I'll update the description, happy to update the 
title too to reflect that this hits ZK HA as well, let me know if you'd prefer 
a specific wording.

> Kubernetes HA can recover a globally terminal FAILED application job after 
> leadership revoke/reacquire
> ------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39704
>                 URL: https://issues.apache.org/jira/browse/FLINK-39704
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 2.2.0, 2.4.0
>            Reporter: Prashant Bhardwaj
>            Priority: Major
>         Attachments: flink-config-ha-reanimation-repro-cm-2026-05-19.yaml, 
> jm-ha-reanimation-repro-current-2026-05-18.log, 
> jm-ha-reanimation-repro-events-2026-05-18.txt, 
> jm-ha-reanimation-repro-pod-describe-2026-05-18.txt, 
> jm-ha-reanimation-repro-previous-2026-05-18.log
>
>
> In a Kubernetes HA application cluster, a job that has already reached the 
> globally terminal FAILED state can be recovered and restarted with the same 
> JobID if Kubernetes leadership is revoked/reacquired immediately after the 
> terminal transition.
> Observed with apache/flink:2.2.0 and Kubernetes HA.
> *Timeline from repro:*
> {noformat}
> 20:52:51.075  Task failure after TaskManager deletion
> 20:52:51.119  Job e7ce38da0a5b4651ce64453d6ffaa25b switched RUNNING -> FAILING
> 20:52:51.122  Job e7ce38da0a5b4651ce64453d6ffaa25b switched FAILING -> FAILED
> 20:52:52.615  KubernetesLeaderElector observed empty leader holder
> 20:52:52.616  Leadership revoked
> 20:52:52.618  Dispatcher reported same job as terminal SUSPENDED
> 20:52:52.921  DefaultExecutionPlanStore released execution plan 
> e7ce38da0a5b4651ce64453d6ffaa25b
> 20:52:52.926  Same job id was retrieved from KubernetesStateHandleStore
> 20:52:53.035  Same StreamGraph(jobId: e7ce38da0a5b4651ce64453d6ffaa25b) was 
> recovered
> 20:53:11.340  Same job switched CREATED -> RUNNING
> {noformat}
> *Expected:*
> Once a job reaches globally terminal FAILED, later leadership 
> revocation/close should not overwrite or mask the globally terminal result as 
> SUSPENDED. HA metadata should be cleaned up as a globally terminal job, and 
> the same job should not be recovered.
> *Actual:*
> Leadership revocation closes the running JobMaster/Dispatcher path with 
> synthetic SUSPENDED after the real FAILED result. The execution plan is 
> released rather than permanently removed, so the same job id remains 
> recoverable from Kubernetes HA storage and is started again.
> A secondary issue is also visible in the same churn window:
> DefaultLeaderElectionService receives a grant while issuedLeaderSessionID is 
> already set and throws:
> java.lang.IllegalStateException:
> The leadership should have been granted while not having the leadership 
> acquired.
> This crashes the JobManager entrypoint, but the reanimation has already 
> happened before the fatal error: the failed job was released/recovered from 
> HA metadata.
> *Reproduction outline:*
> 1. Run a Kubernetes HA application cluster with restart-strategy.type: none.
> 2. Use a persistent HA storage dir.
> 3. Delete the TaskManager so the job reaches FAILED.
> 4. Immediately after observing RUNNING -> FAILING, patch the cluster leader 
> ConfigMap annotation holderIdentity to empty, forcing leadership 
> loss/reacquire.
> 5. Observe FAILED followed by SUSPENDED/release/recovery of the same JobID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-39704) Kubernetes HA can recover a globally terminal FAILED application job after leadership revoke/reacquire

Reply via email to