[jira] [Commented] (FLINK-22006) Could not run more than 20 jobs in a native K8s session when K8s HA enabled

Yang Wang (Jira) Mon, 29 Mar 2021 05:03:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310619#comment-17310619
 ]


Yang Wang commented on FLINK-22006:
-----------------------------------

Yes, I could reproduce this issue. And I believe the root cause is fabric8 
Kubernetes client has configured the {{MaxRequests}} of 
{{OkHttpClient#Dispatcher}}[1] to 64[2], which means we could not create more 
than 64 watchers in the JobManager pod.

Normally, it could be configured in Flink via 
{{-Denv.java.opts="-Dkubernetes.max.concurrent.requests=1000"}}. Unfortunately, 
the fabric8 Kubernetes client version 4.9.2 has a bug[3], which causes the max 
concurrent requests could not be set via system properties.

 

The fabric8 Kubernetes client 4.10 has introduced too many changes and I do not 
suggest to bump the dependency in Flink now. Instead, I think we could try to 
set the max concurrent requests in 
{{DefaultKubeClientFactory#fromConfiguration}} when building the Kubernetes 
client.

 

Your suggestion about using a single ConfigMap watch is somewhat reasonable. 
However, in current HA design, {{DefaultLeaderRetrievalService}} is not aware 
of all the running jobs. So it is not very easy to dispatch different watcher 
events to corresponding listeners. It is a big change and we need more 
discussion.

 
 # 
[https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/utils/HttpClientUtils.java#L166]
 # 
[https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Config.java#L135]
 # [https://github.com/fabric8io/kubernetes-client/issues/2531]

> Could not run more than 20 jobs in a native K8s session when K8s HA enabled
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-22006
>                 URL: https://issues.apache.org/jira/browse/FLINK-22006
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.2, 1.13.0
>            Reporter: Yang Wang
>            Priority: Critical
>              Labels: k8s-ha
>         Attachments: image-2021-03-24-18-08-42-116.png
>
>
> Currently, if we start a native K8s session cluster when K8s HA enabled, we 
> could not run more than 20 streaming jobs. 
>  
> The latest job is always initializing, and the previous one is created and 
> waiting to be assigned. It seems that some internal resources have been 
> exhausted, e.g. okhttp thread pool , tcp connections or something else.
> !image-2021-03-24-18-08-42-116.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22006) Could not run more than 20 jobs in a native K8s session when K8s HA enabled

Reply via email to