[
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806782#comment-17806782
]
Matthias Pohl commented on FLINK-34007:
---------------------------------------
{quote}
At least based on the reports of this Jira issue, there must have been an
incident (which caused the lease to not be renewed) in a k8s cluster that
triggered the same failure in multiple Flink clusters (with versions of 1.18,
1.17 and 1.16 at least) that triggered the same issue in all of those
deployments. ...if I understand it correctly.
{quote}
[~ZhenqiuHuang] can you elaborate a bit on the incident itself. ...just to get
a bit more context. Did I understand it correctly that there were different
Flink versions deployed in a single Kubernetes cluster which run independently
and all of them ran into the same issue around the same time (indicating that
would have been caused by the same event). Or did the failures in the different
Flink clusters happen independently from each other over a longer stretch of
time?
> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
> Reporter: Zhenqiu Huang
> Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)