[
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806759#comment-17806759
]
Matthias Pohl commented on FLINK-34007:
---------------------------------------
But it should be an issue that is available in different k8s client version
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|
At least based on the reports of this Jira issue, there must have been an
incident in a k8s cluster that triggered the same failure in multiple Flink
clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the
same issue in all of those deployments. ...if I understand it correctly.
Therefore, the issue should exist in [5.12.3, 6.6.2].
---
On another note: I remembered that there is a slight difference in the
revocation protocol in the
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
changes:
* The old implementation (see [1.15
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
did try to clear the leader information from the ConfigMap.
* The new implementation (see [1.18+
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
doesn't clear the component leader information, anymore. Here, the reasoning
was that the data wouldn't be able to be updated, anymore, because the
leadership is already lost.
But that change still seems to be reasonable based on my findings: In the k8s
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the
change in the lock identity
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
the lock identity change would prevent the clearing of the data) or when the
lease wasn't renewed
([LeaderElector:95)|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95])
where we would have to assume that other leader information is already
written. And this change shouldn't be related to the issues with the lock
lifecycle in general because it only affects metadata and not the lock
annotation itself, should it? WDYT?
> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
> Reporter: Zhenqiu Huang
> Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)