[
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805960#comment-17805960
]
Matthias Pohl commented on FLINK-34007:
---------------------------------------
I couldn't find anything related to ConfigMap cleanup being triggered during
leadership loss in the Flink code (Flink's
[KubernetesLeaderElector:82|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L82]
sets up the [callback for the leadership
loss|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L124]
which is implemented by
[KubernetesLeaderElectionDriver#LeaderCallbackHandlerImpl|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L212]).
This behavior is on-par with the old (i.e. pre-FLINK-24038) version of the
Flink 1.15 codebase (see 1.15 class
[KubernetesLeaderElectionDriver:202|https://github.com/apache/flink/blob/bba7c417217be878fffb12efedeac50dec2a7459/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L202]).
That's also not something I would expect. It should be handled by the
LeaderElector, instead, because the LeaderElector knows the state of the leader
election and can trigger a clean up before the new leader information is
written to the ConfigMap entry. Flink shouldn't trigger a clean up because it
doesn't know whether a new leader was already elected (in which case cleaning
up the ConfigMap entry would result in losing the leadership information of the
new leader). And the Flink process wouldn't be able to clean it up, anyway,
because the process isn't the leader anymore. Or am I missing something here?
On another note: I came across [this
change|https://github.com/apache/flink/pull/22540/files#diff-0e859df42954459619211d2ec60957742b24c9fc6ce55526616fddc540f0f8ffL59-R60]
in the FLINK-31997 PR (k8s client update to 6.6.2): We're changing the thread
pool size from 1 to 3 essentially allowing the same internal LeaderElector
being executed multiple times (because we trigger another
{{KubernetesLeaderElector#run}} call when the leadership is revoked). The old
version of the code use a single thread which would mean that the run call
would get queued until the previous {{LeaderElector#run}} failed for whatever
reason. That change sounds strange but shouldn't be the cause of this Jira
issue because the change only went into 1.18 and we're experiencing this also
in older versions of Flink.
> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
> Reporter: Zhenqiu Huang
> Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)