[
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806410#comment-17806410
]
Zhenqiu Huang commented on FLINK-34007:
---------------------------------------
[~mapohl]
Yes, from the observation on the failure case, the ConfigMap was not cleanup
when job manager lose the leadership. Even the renewTime field is no longer
upgraded by leader elector, it means leader elector already goes out of its run
loop. If look into the fabric8 leader elector source code, it looks like only
when renew deadline expired, LeaderElector will abort from its run loop. Even
through I don't know why renew deadline expired, enlarge the
high-availability.kubernetes.leader-election.renew-deadline value could isolate
some transient issues.
I have started a testing job with debug log of both
io.fabric8.kubernetes.client.extended.leaderelection and flink kubernetes
leader election modules two days ago. If the job fail, I will post new logs in
this thread.
[~wangyang0918]
Would you please elaborate a little bit why "It seems that the fabric8
Kubernetes client leader elector will not work properly by run() more than once
if we do not clean up the leader annotation."?
> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
> Reporter: Zhenqiu Huang
> Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)