[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Matthias Pohl (Jira) Mon, 15 Jan 2024 03:42:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806759#comment-17806759
 ]


Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:41 AM:
-----------------------------------------------------------------

But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident (which caused the lease to not be renewed) in a k8s cluster that 
triggered the same failure in multiple Flink clusters (with versions of 1.18, 
1.17 and 1.16 at least) that triggered the same issue in all of those 
deployments. ...if I understand it correctly.

Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
 the lock identity change would prevent the clearing of the data) or when the 
lease wasn't renewed 
([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95])
 where we would have to assume that other leader information is already 
written. And this change shouldn't be related to the issues with the lock 
lifecycle in general because it only affects metadata and not the lock 
annotation itself, should it? WDYT?


was (Author: mapohl):
But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident in a k8s cluster that triggered the same failure in multiple Flink 
clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the 
same issue in all of those deployments. ...if I understand it correctly.

Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
 the lock identity change would prevent the clearing of the data) or when the 
lease wasn't renewed 
([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95])
 where we would have to assume that other leader information is already 
written. And this change shouldn't be related to the issues with the lock 
lifecycle in general because it only affects metadata and not the lock 
annotation itself, should it? WDYT?

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

Reply via email to