[jira] [Comment Edited] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Yi Tang (Jira) Wed, 24 Mar 2021 01:08:10 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307635#comment-17307635
 ]


Yi Tang edited comment on FLINK-21942 at 3/24/21, 8:07 AM:
-----------------------------------------------------------

Did you find any log like 
{code}
Stopping 
KubernetesLeaderRetrievalDriver{configMapName='xxxx-{jobid}-jobmanager-leader'}.
{code}
excepts for those timeout jobs.

Furthermore, if many jobs are running (exceeds the max concurrent requests 64 
which looks like can not be configured), the API server is OK. The problem is 
that the new watch for new job(election and retrieve) can not be established.

I'll try to provide a detailed steps to reproduce it.



was (Author: yittg):
Did you find any log like 
{code}
Stopping 
KubernetesLeaderRetrievalDriver{configMapName='xxxx-{jobid}-jobmanager-leader'}.
{/code}
excepts for those timeout jobs.

Furthermore, if many jobs are running (exceeds the max concurrent requests 64 
which looks like can not be configured), the API server is OK. The problem is 
that the new watch for new job(election and retrieve) can not be established.

I'll try to provide a detailed steps to reproduce it.


> KubernetesLeaderRetrievalDriver not closed after terminated which lead to 
> connection leak
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-21942
>                 URL: https://issues.apache.org/jira/browse/FLINK-21942
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yi Tang
>            Priority: Major
>
> Looks like KubernetesLeaderRetrievalDriver is not closed even if the 
> KubernetesLeaderElectionDriver is closed and job reach globally terminated.
> This will lead to many configmap watching be still active with connections to 
> K8s.
> When the connections exceeds max concurrent requests, those new configmap 
> watching can not be started. Finally leads to all new jobs submitted timeout.
> [~fly_in_gis] [~trohrmann] This may be related to FLINK-20695, could you 
> confirm this issue?
> But when many jobs are running in same session cluster, the config map 
> watching is required to be active. Maybe we should merge all config maps 
> watching?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Reply via email to