[jira] [Commented] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Yi Tang (Jira) Wed, 24 Mar 2021 03:15:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307734#comment-17307734
 ]


Yi Tang commented on FLINK-21942:
---------------------------------

kubernetes session cluster with options:
{code}
cluster.io-pool.size=16
kubernetes.jobmanager.cpu=2
{code}
h2. the simplest scenario

run batch job one by one, e.g. examples/batch/WordCount.jar

!image-2021-03-24-18-08-30-196.png|width=990,height=629!

the latest job is created and waiting to be assigned, the previous one failed 
after waiting 5m.
h2. another scenario:

since resource limit set like following:
{code}
kubernetes.taskmanager.cpu=2
taskmanager.numberOfTaskSlots: 10
{code}

run streaming job, examples/streaming/StateMachineExample.jar --error-rate 0.5 
--sleep 100

!image-2021-03-24-18-08-42-116.png|width=990,height=629!

The latest job is always initializing, and the previous one is created and 
waiting to be assigned.

> KubernetesLeaderRetrievalDriver not closed after terminated which lead to 
> connection leak
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-21942
>                 URL: https://issues.apache.org/jira/browse/FLINK-21942
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yi Tang
>            Priority: Major
>         Attachments: image-2021-03-24-18-07-49-069.png, 
> image-2021-03-24-18-08-30-196.png, image-2021-03-24-18-08-42-116.png
>
>
> Looks like KubernetesLeaderRetrievalDriver is not closed even if the 
> KubernetesLeaderElectionDriver is closed and job reach globally terminated.
> This will lead to many configmap watching be still active with connections to 
> K8s.
> When the connections exceeds max concurrent requests, those new configmap 
> watching can not be started. Finally leads to all new jobs submitted timeout.
> [~fly_in_gis] [~trohrmann] This may be related to FLINK-20695, could you 
> confirm this issue?
> But when many jobs are running in same session cluster, the config map 
> watching is required to be active. Maybe we should merge all config maps 
> watching?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Reply via email to