[
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804638#comment-17804638
]
xiaogang zhou edited comment on FLINK-33728 at 1/9/24 8:52 AM:
---------------------------------------------------------------
[~xtsong] In a default FLINK setting, when the KubenetesClient disconnects
from KUBE API server, it will try to reconnect for infinitely times. As
kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat
ResourceVersionTooOld as a special exception, as it will escape from the normal
reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for
kubernetes.transactional-operation.max-retries times, and these retries have
not interval between them. If the watcher does not recover, the JM will kill it
self.
So I think the problem we are trying to solve is not only to avoid massive
Flink jobs trying to re-creating watches at the same time. But also how to
allow FLINK to continue running even when the KUBE API is in a disorder
situation. As for most of the times, FLINK TMs do not need to be bothered by a
bad API server .
If you think it is not acceptable to recover the watcher only requesting
resource, I think another possible way is , we can retry to rewatch pods
periodically.
WDYT? :)
was (Author: zhoujira86):
[~xtsong] In a default FLINK setting, when the KubenetesClient disconnects
from KUBE API server, it will try to reconnect for infinitely times. As
kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat
ResourceVersionTooOld as a special exception, as it will escape from the normal
reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for
kubernetes.transactional-operation.max-retries times. If the watcher does not
recover, the JM will kill it self.
So I think the problem we are trying to solve is not only to avoid massive
Flink jobs trying to re-creating watches at the same time. But also how to
allow FLINK to continue running even when the KUBE API is in a disorder
situation. As for most of the times, FLINK TMs do not need to be bothered by a
bad API server .
If you think it is not acceptable to recover the watcher only requesting
resource, I think another possible way is , we can retry to rewatch pods
periodically.
WDYT? :)
> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
> Issue Type: New Feature
> Components: Deployment / Kubernetes
> Reporter: xiaogang zhou
> Priority: Major
> Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen.
> After Kube recoverd after 1 hour, Thousands of Flink jobs using
> kubernetesResourceManagerDriver rewatched when recieving
> ResourceVersionTooOld, which caused great pressure on API Server and made
> API server failed again...
>
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in PodCallbackHandlerImpl# handleError method?
>
> We can just neglect the disconnection of watching process. and try to rewatch
> once new requestResource called. And we can leverage on the akka heartbeat
> timeout to discover the TM failure, just like YARN mode do.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)