[
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiaogang zhou updated FLINK-33728:
----------------------------------
Description:
I met massive production problem when kubernetes ETCD slow responding happen.
After Kube recoverd after 1 hour, Thousands of Flink jobs using
kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld,
which caused great pressure on API Server and made API server failed again...
I am not sure is it necessary to
getResourceEventHandler().onError(throwable)
in PodCallbackHandlerImpl# handleError method?
We can just neglect the disconnection of watching process. and try to rewatch
once new requestResource called. And we can leverage on the akka heartbeat
timeout to discover the TM failure, just like YARN mode do.
was:
is it necessary to
getResourceEventHandler().onError(throwable)
in PodCallbackHandlerImpl# handleError method.
We can just neglect the disconnection of watching process. and try to rewatch
once new requestResource called
> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
> Issue Type: New Feature
> Components: Deployment / Kubernetes
> Reporter: xiaogang zhou
> Priority: Major
>
> I met massive production problem when kubernetes ETCD slow responding happen.
> After Kube recoverd after 1 hour, Thousands of Flink jobs using
> kubernetesResourceManagerDriver rewatched when recieving
> ResourceVersionTooOld, which caused great pressure on API Server and made
> API server failed again...
>
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in PodCallbackHandlerImpl# handleError method?
>
> We can just neglect the disconnection of watching process. and try to rewatch
> once new requestResource called. And we can leverage on the akka heartbeat
> timeout to discover the TM failure, just like YARN mode do.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)