[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795451#comment-17795451 ]
Matthias Pohl commented on FLINK-33728: --------------------------------------- Thanks for creating this Jira issue, [~zhoujira86]. AFAIU, you're proposing the lazy initialization of the watcher after an connection error occurred that left the resourceVersion in an out-dated state (i.e. the resourceVersion which is used by the k8s client doesn't match any pod in the k8s cluster). Re-initialization of the watcher wouldn't happen when the error is detected but when Flink realizes that the TM is gone and initiates a new TM pod. Correct me if I'm wrong here but isn't the watcher watching multiple pods (all TM pods belonging to the Flink cluster) and the {{KubernetesTooOldResourceVersionException}} can be triggered by an error coming from a single pod? If that's the case, not re-initializing the watcher right away would leave us hanging for other pods' lifecycle events wouldn't it? We would lose the ability to detect the deletion of other pods. But I guess that's what you mean in your comment above with "delete pod can allow us detect pod failure more quickly, but we can also discover it by detecting the lost of akka heartbeat timeout."?! > do not rewatch when KubernetesResourceManagerDriver watch fail > -------------------------------------------------------------- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes > Reporter: xiaogang zhou > Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)