[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Gyula Fora (Jira) Thu, 18 Jan 2024 22:42:03 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808485#comment-17808485
 ]


Gyula Fora commented on FLINK-34007:
------------------------------------

[~wangyang0918] the tests failed. The executor service (single threaded) 
previously was only used to execute Flink side logic and now we had to pass it 
to the LeaderElector itself as well so a single thread kind of deadlocked it 
somehow.

So I increased to 3 and it made it work. Yesterday I started to think that it 
may actually be a reason why we see ConfigMap version conficts (and lost 
leaderships) in the first place.

This is probably unrelated to why it cannot recover the leadership but I am 
going to try to change back to 1 or use 2 different single threaded executors.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Blocker
>              Labels: pull-request-available
>         Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Reply via email to