[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805294#comment-17805294
 ] 

Zhenqiu Huang commented on FLINK-34007:
---------------------------------------

[~mapohl]
Yes, I mistakenly looked into the flink 1.17 source code. I uploaded another 
debug log above. The KubernetesLeaderElector check the annotation 
"control-plane.alpha.kubernetes.io/leader" and whether the lockIdentity exists 
in content. Given this job only has 1 job manager, there should be no other job 
manager instance try to acquire the lock. The only possibility is that somehow 
the cluster config map is returned incorrectly.

In this case, even fabric8 LeaderElector will continue to try to acquire 
leadership (If it can get without exceed deadline), flink will not able to 
restart services (such RM and dispatcher) as DefaultLeaderRetrievalService is 
stopped also. To resolve the issue for now, should we focus on gracefully 
shutdown Job Manager rather than move job to Suspended status?  


> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to