[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806782#comment-17806782
 ] 

Matthias Pohl commented on FLINK-34007:
---------------------------------------

{quote}
At least based on the reports of this Jira issue, there must have been an 
incident (which caused the lease to not be renewed) in a k8s cluster that 
triggered the same failure in multiple Flink clusters (with versions of 1.18, 
1.17 and 1.16 at least) that triggered the same issue in all of those 
deployments. ...if I understand it correctly.
{quote}
[~ZhenqiuHuang] can you elaborate a bit on the incident itself. ...just to get 
a bit more context. Did I understand it correctly that there were different 
Flink versions deployed in a single Kubernetes cluster which run independently 
and all of them ran into the same issue around the same time (indicating that 
would have been caused by the same event). Or did the failures in the different 
Flink clusters happen independently from each other over a longer stretch of 
time?

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to