[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Zhenqiu Huang (Jira) Sat, 13 Jan 2024 16:24:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806410#comment-17806410
 ]


Zhenqiu Huang commented on FLINK-34007:
---------------------------------------

[~mapohl]
Yes, from the observation on the failure case, the ConfigMap was not cleanup 
when job manager lose the leadership. Even the renewTime field is no longer 
upgraded by leader elector, it means leader elector already goes out of its run 
loop. If look into the fabric8 leader elector source code, it looks like only 
when renew deadline expired, LeaderElector will abort from its run loop. Even 
through I don't know why  renew deadline expired, enlarge the 
high-availability.kubernetes.leader-election.renew-deadline value could isolate 
some transient issues. 

I have started a testing job with debug log of both 
io.fabric8.kubernetes.client.extended.leaderelection and flink kubernetes 
leader election modules two days ago. If the job fail, I will post new logs in 
this thread.

[~wangyang0918]
Would you please elaborate a little bit why "It seems that the fabric8 
Kubernetes client leader elector will not work properly by run() more than once 
if we do not clean up the leader annotation."?



> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Reply via email to