[jira] [Comment Edited] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Yi Tang (Jira) Wed, 24 Mar 2021 03:21:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307734#comment-17307734
 ]


Yi Tang edited comment on FLINK-21942 at 3/24/21, 10:20 AM:
------------------------------------------------------------

kubernetes session cluster with options:
{code}
cluster.io-pool.size=16
kubernetes.jobmanager.cpu=2
{code}
h2. the simplest scenario

run batch job one by one, e.g. examples/batch/WordCount.jar

!image-2021-03-24-18-08-30-196.png|width=990,height=629!

In this scenario, the latest job is created and waiting to be assigned, the 
previous one failed after waiting 5m. The FINISHED jobs only occupy one config 
map watch, so almost 60 job can be run successfully.


h2. another scenario:

since resource limit set like following:
{code}
kubernetes.taskmanager.cpu=2
taskmanager.numberOfTaskSlots: 10
{code}

run streaming job, examples/streaming/StateMachineExample.jar --error-rate 0.5 
--sleep 100

!image-2021-03-24-18-08-42-116.png|width=990,height=629!

In this scenario, the latest job is always initializing, and the previous one 
is created and waiting to be assigned. The FINISHED jobs only occupies one 
config map watch, so almost 60 job can be run successfully. The RUNNING jobs 
occupy three config map watch, so almost 20 job can be running successfully.


was (Author: yittg):
kubernetes session cluster with options:
{code}
cluster.io-pool.size=16
kubernetes.jobmanager.cpu=2
{code}
h2. the simplest scenario

run batch job one by one, e.g. examples/batch/WordCount.jar

!image-2021-03-24-18-08-30-196.png|width=990,height=629!

the latest job is created and waiting to be assigned, the previous one failed 
after waiting 5m.
h2. another scenario:

since resource limit set like following:
{code}
kubernetes.taskmanager.cpu=2
taskmanager.numberOfTaskSlots: 10
{code}

run streaming job, examples/streaming/StateMachineExample.jar --error-rate 0.5 
--sleep 100

!image-2021-03-24-18-08-42-116.png|width=990,height=629!

The latest job is always initializing, and the previous one is created and 
waiting to be assigned.

> KubernetesLeaderRetrievalDriver not closed after terminated which lead to 
> connection leak
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-21942
>                 URL: https://issues.apache.org/jira/browse/FLINK-21942
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yi Tang
>            Priority: Major
>         Attachments: image-2021-03-24-18-08-30-196.png, 
> image-2021-03-24-18-08-42-116.png
>
>
> Looks like KubernetesLeaderRetrievalDriver is not closed even if the 
> KubernetesLeaderElectionDriver is closed and job reach globally terminated.
> This will lead to many configmap watching be still active with connections to 
> K8s.
> When the connections exceeds max concurrent requests, those new configmap 
> watching can not be started. Finally leads to all new jobs submitted timeout.
> [~fly_in_gis] [~trohrmann] This may be related to FLINK-20695, could you 
> confirm this issue?
> But when many jobs are running in same session cluster, the config map 
> watching is required to be active. Maybe we should merge all config maps 
> watching?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-21942) KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak

Reply via email to