[
https://issues.apache.org/jira/browse/FLINK-32010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Morávek updated FLINK-32010:
----------------------------------
Component/s: Deployment / Kubernetes
Runtime / Coordination
> KubernetesLeaderRetrievalDriver always waits for lease update to resolve
> leadership
> -----------------------------------------------------------------------------------
>
> Key: FLINK-32010
> URL: https://issues.apache.org/jira/browse/FLINK-32010
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes, Runtime / Coordination
> Affects Versions: 1.17.0, 1.16.1, 1.18.0
> Reporter: David Morávek
> Assignee: David Morávek
> Priority: Major
>
> The k8s-based leader retrieval is based on ConfigMap watching. The config map
> lifecycle (from the consumer point of view) is handled as a series of events
> with the following types:
> * ADDED -> the first time the consumer has seen the CM
> * UPDATED -> any further changes to the CM
> * DELETED -> ... you get the idea
> The implementation assumes that ElectionDriver (the one that creates the CM)
> and ElectionRetriver are started simultaneously and therefore ignore the
> ADDED events because the CM is always created as empty and is updated with
> the leadership information later on.
> This assumption is incorrect in the following cases (I might be missing some,
> but that's not important, the goal is to illustrate the problem):
> * TM joining the cluster later when the leaders are established to discover
> RM / JM
> * RM tries to discover JM when
> MultipleComponentLeaderElectionDriver is used
> This, for example, leads to higher job submission latencies that could be
> unnecessarily held back for up to the lease retry period [1].
> [1] Configured by _high-availability.kubernetes.leader-election.retry-period_
--
This message was sent by Atlassian Jira
(v8.20.10#820010)