[
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806592#comment-17806592
]
Yang Wang commented on FLINK-33728:
-----------------------------------
Not only the {{KubernetesResourceManagerDriver}} will create a new watch when
received the {{{}TooOldResourceVersion{}}}, but also the fabric8 K8s client has
the similar logic in {{{}Reflector.java{}}}[1], which we are using for the
Flink Kubernetes HA implementation.
In my opinion, the K8s APIServer should have the ability to protect itself by
using the flow control[2]. Then it will reject some requests if it could not
process too many requests. Flink will then retry to create a new watch when the
previous one failed. What Flink could do more is using a
{{ExponentialBackoffDelayRetryStrategy}} to replace current continuous retry
strategy.
[1].
[https://github.com/fabric8io/kubernetes-client/blob/v6.6.2/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java#L288]
[2]. [https://kubernetes.io/docs/concepts/cluster-administration/flow-control/]
> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
> Issue Type: New Feature
> Components: Deployment / Kubernetes
> Reporter: xiaogang zhou
> Priority: Major
> Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen.
> After Kube recoverd after 1 hour, Thousands of Flink jobs using
> kubernetesResourceManagerDriver rewatched when recieving
> ResourceVersionTooOld, which caused great pressure on API Server and made
> API server failed again...
>
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in PodCallbackHandlerImpl# handleError method?
>
> We can just neglect the disconnection of watching process. and try to rewatch
> once new requestResource called. And we can leverage on the akka heartbeat
> timeout to discover the TM failure, just like YARN mode do.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)