[
https://issues.apache.org/jira/browse/FLINK-22006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310619#comment-17310619
]
Yang Wang commented on FLINK-22006:
-----------------------------------
Yes, I could reproduce this issue. And I believe the root cause is fabric8
Kubernetes client has configured the {{MaxRequests}} of
{{OkHttpClient#Dispatcher}}[1] to 64[2], which means we could not create more
than 64 watchers in the JobManager pod.
Normally, it could be configured in Flink via
{{-Denv.java.opts="-Dkubernetes.max.concurrent.requests=1000"}}. Unfortunately,
the fabric8 Kubernetes client version 4.9.2 has a bug[3], which causes the max
concurrent requests could not be set via system properties.
The fabric8 Kubernetes client 4.10 has introduced too many changes and I do not
suggest to bump the dependency in Flink now. Instead, I think we could try to
set the max concurrent requests in
{{DefaultKubeClientFactory#fromConfiguration}} when building the Kubernetes
client.
Your suggestion about using a single ConfigMap watch is somewhat reasonable.
However, in current HA design, {{DefaultLeaderRetrievalService}} is not aware
of all the running jobs. So it is not very easy to dispatch different watcher
events to corresponding listeners. It is a big change and we need more
discussion.
#
[https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/utils/HttpClientUtils.java#L166]
#
[https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Config.java#L135]
# [https://github.com/fabric8io/kubernetes-client/issues/2531]
> Could not run more than 20 jobs in a native K8s session when K8s HA enabled
> ---------------------------------------------------------------------------
>
> Key: FLINK-22006
> URL: https://issues.apache.org/jira/browse/FLINK-22006
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.12.2, 1.13.0
> Reporter: Yang Wang
> Priority: Critical
> Labels: k8s-ha
> Attachments: image-2021-03-24-18-08-42-116.png
>
>
> Currently, if we start a native K8s session cluster when K8s HA enabled, we
> could not run more than 20 streaming jobs.
>
> The latest job is always initializing, and the previous one is created and
> waiting to be assigned. It seems that some internal resources have been
> exhausted, e.g. okhttp thread pool , tcp connections or something else.
> !image-2021-03-24-18-08-42-116.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)