[ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794900#comment-17794900
 ] 

xiaogang zhou commented on FLINK-33728:
---------------------------------------

[~gyfora] my proposal is keep the jobmanager running even rewatch fail. 

A healthy watch listener can get notification from kubernetes of two kind:

add pod and delete pod.

1. add pod is necessary when request resource, when we are not requesting 
resource, this notification is allowed to be lost.

2. delete pod can allow us detect pod failure more quickly, but we can also 
discover it by detecting the lost of akka heartbeat timeout.

 

according to the statement above, we can tolerate the lost of watch connection 
when we are not requesting resource

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to