[jira] [Commented] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Yang Wang (Jira) Wed, 24 Mar 2021 00:49:23 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307627#comment-17307627
 ]


Yang Wang commented on FLINK-21942:
-----------------------------------

I do not think we have Kubernetes watch leak currently. When a Flink job 
reached to terminal state(e.g. canceled), {{KubernetesLeaderRetrievalDriver}} 
could be stopped. And I have verified in the application mode and session mode. 
When I cancel the job, I could find the following logs in the JobManager log.
{code:java}
2021-03-24 06:49:11,958 INFO  
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Stopping DefaultLeaderRetrievalService.2021-03-24 06:49:11,958 INFO  
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] 
- Stopping 
KubernetesLeaderRetrievalDriver{configMapName='standalone-k8s-ha-session-resourcemanager-leader'}.{code}
Could you please double check for that?

 

Currently, in Kubernetes HA service, each job will have the dedicated leader 
election and retrieval service. It is clearer and also aligned with ZooKeeper 
HA service. AFAIK, I do not think this will take a burden to the APIServer, 
which could be started with multiple replicas. Have you found that the 
APIServer become the bottleneck because of Flink watch?

> KubernetesLeaderRetrievalDriver not closed after terminated which lead to 
> connection leak
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-21942
>                 URL: https://issues.apache.org/jira/browse/FLINK-21942
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yi Tang
>            Priority: Major
>
> Looks like KubernetesLeaderRetrievalDriver is not closed even if the 
> KubernetesLeaderElectionDriver is closed and job reach globally terminated.
> This will lead to many configmap watching be still active with connections to 
> K8s.
> When the connections exceeds max concurrent requests, those new configmap 
> watching can not be started. Finally leads to all new jobs submitted timeout.
> [~fly_in_gis] [~trohrmann] This may be related to FLINK-20695, could you 
> confirm this issue?
> But when many jobs are running in same session cluster, the config map 
> watching is required to be active. Maybe we should merge all config maps 
> watching?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Reply via email to