[
https://issues.apache.org/jira/browse/FLINK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308488#comment-17308488
]
Till Rohrmann commented on FLINK-21942:
---------------------------------------
I think the closing of the {{LeaderRetrievalService}} happens when
{{ResourceManager.removeJob}} is called. This happens currently only after a
job has no leader for at least 5 minutes (configurable via
{{resourcemanager.job.timeout}}). With FLINK-21751 we added the state of the
job when {{disconnectJobManager}} is called. We could use this to call
{{removeJob}} when the job status is globally terminal. This could solve the
problem.
> KubernetesLeaderRetrievalDriver not closed after terminated which lead to
> connection leak
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-21942
> URL: https://issues.apache.org/jira/browse/FLINK-21942
> Project: Flink
> Issue Type: Bug
> Reporter: Yi Tang
> Priority: Major
> Attachments: image-2021-03-24-18-08-30-196.png,
> image-2021-03-24-18-08-42-116.png, jstack.l
>
>
> Looks like KubernetesLeaderRetrievalDriver is not closed even if the
> KubernetesLeaderElectionDriver is closed and job reach globally terminated.
> This will lead to many configmap watching be still active with connections to
> K8s.
> When the connections exceeds max concurrent requests, those new configmap
> watching can not be started. Finally leads to all new jobs submitted timeout.
> [~fly_in_gis] [~trohrmann] This may be related to FLINK-20695, could you
> confirm this issue?
> But when many jobs are running in same session cluster, the config map
> watching is required to be active. Maybe we should merge all config maps
> watching?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)