[
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804553#comment-17804553
]
xiaogang zhou commented on FLINK-33728:
---------------------------------------
[~mapohl] I think your concern is really very important. I think my statement
is not good enough. After your reminder, I'd like to change it to :
We can just neglect the disconnection of watching process {color:#FF0000}if
there is no pending request{color}. and try to rewatch once new requestResource
called.
And we can choose to fail all CompletableFuture And the
[requestWorkerIfRequired|https://github.com/apache/flink/blob/2b9b9859253698c3c90ca420f10975e27e6c52d4/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java#L332]
will request the resource again, this will trigger the rewatch.
WDYT [~mapohl] [~xtsong]
> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
> Issue Type: New Feature
> Components: Deployment / Kubernetes
> Reporter: xiaogang zhou
> Priority: Major
> Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen.
> After Kube recoverd after 1 hour, Thousands of Flink jobs using
> kubernetesResourceManagerDriver rewatched when recieving
> ResourceVersionTooOld, which caused great pressure on API Server and made
> API server failed again...
>
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in PodCallbackHandlerImpl# handleError method?
>
> We can just neglect the disconnection of watching process. and try to rewatch
> once new requestResource called. And we can leverage on the akka heartbeat
> timeout to discover the TM failure, just like YARN mode do.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)