[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

Matthias Pohl (Jira) Mon, 11 Dec 2023 09:21:08 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795451#comment-17795451
 ]


Matthias Pohl commented on FLINK-33728:
---------------------------------------

Thanks for creating this Jira issue, [~zhoujira86]. AFAIU, you're proposing the 
lazy initialization of the watcher after an connection error occurred that left 
the resourceVersion in an out-dated state (i.e. the resourceVersion which is 
used by the k8s client doesn't match any pod in the k8s cluster). 
Re-initialization of the watcher wouldn't happen when the error is detected but 
when Flink realizes that the TM is gone and initiates a new TM pod.

Correct me if I'm wrong here but isn't the watcher watching multiple pods (all 
TM pods belonging to the Flink cluster) and the 
{{KubernetesTooOldResourceVersionException}} can be triggered by an error 
coming from a single pod? If that's the case, not re-initializing the watcher 
right away would leave us hanging for other pods' lifecycle events wouldn't it? 
We would lose the ability to detect the deletion of other pods. But I guess 
that's what you mean in your comment above with "delete pod can allow us detect 
pod failure more quickly, but we can also discover it by detecting the lost of 
akka heartbeat timeout."?!

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>              Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

Reply via email to