[
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805157#comment-17805157
]
Matthias Pohl commented on FLINK-34007:
---------------------------------------
Ok, I went through the log file you shared. AFAIS, suspending the JobManager
worked as expected:
* The Job with the ID {{217cee964b2cfdc3115fb74cac0ec550}} was suspended due to
the leadership loss for session ID {{9987190b-35f4-4238-b317-057dc3615e4d}}.
* The ResourceManager and the Dispatcher got their leadership revoked as well.
* The ResourceManager is not shut down.
* The Dispatcher is stopped but the corresponding DispatcherLeaderProcess keeps
running. That's the process that should trigger another Dispatcher
initialization if it picks up leadership again.
The {{RecipientUnreachableException}} appears because there's no leader being
re-elected, I guess. Does this match your findings?
You're not having any other standby JM running in the Flink cluster as far as I
understand? We would expect this very same JobManager to pick up leadership
again. Do we have some logs from the Kubernetes cluster that we could
investigate?
> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.1, 1.18.2
> Reporter: Zhenqiu Huang
> Priority: Major
> Attachments: job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)