[ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804564#comment-17804564
 ] 

Xintong Song commented on FLINK-33728:
--------------------------------------

Thanks for pulling me in.

I'm also concerned about solely relying on heartbeat timeout to detecting pod 
failure. In addition the cleaning-up issue, it can also delay the detection of 
pod failure in many cases.

IIUC, the problem we are trying to solve here is to avoid massive Flink jobs 
trying to re-creating watches at the same time. That doesn't necessarily result 
in the proposed solution.
1. I think this is not a problem of individual Flink jobs, but a problem of the 
K8s cluster that runs massive Flink workloads. Ideally, such problems, i.e. how 
to better deal with the massive workloads, should be solved on the K8s cluster 
side. However, I don't have the expertise to come up with a cluster-side 
solution.
2. If 1) is not feasible, I think we can introduce a random backoff. User may 
configure a max backoff time (default 0), and Flink randomly pick a time that 
is no greater than the max to re-create the watch. Ideally, that would spread 
the pressure on API server over a longer and configurable period.

WDYT?

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>              Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to