[
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795571#comment-17795571
]
xiaogang zhou commented on FLINK-33728:
---------------------------------------
Hi [~mapohl] , thanks for the comment above. sorry for my poor writing english
:P, but I think your re-clarification is exactly what I am proposing. I'd like
to introduce a lazy re-initialization of watch mechanism which will tolerate a
disconnection of the watch until a new POD is requested.
And I think your concern is how we detect a TM loss without a active watcher.
I have test my change in a real K8S environment. With a disconnected watcher, I
killed a TM pod. after no more than 50s, the task restarted with a exception
{code:java}
// code placeholder
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id
flink-6168d34cf9d3a5d31ad8bb02bce6a370-taskmanager-1-8 timed out.
at
org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1306)
at
org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitC {code}
moreover, I think YARN also do not have a watcher mechanism, so FLINK scheduled
in yarn also relays on a heartbeat timeout mechanism?
And an active rewatching strategy can really cause great pressure on API
server, especially in the early versions without the resource version zero set
in the watch-list request.
> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
> Issue Type: New Feature
> Components: Deployment / Kubernetes
> Reporter: xiaogang zhou
> Priority: Major
> Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen.
> After Kube recoverd after 1 hour, Thousands of Flink jobs using
> kubernetesResourceManagerDriver rewatched when recieving
> ResourceVersionTooOld, which caused great pressure on API Server and made
> API server failed again...
>
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in PodCallbackHandlerImpl# handleError method?
>
> We can just neglect the disconnection of watching process. and try to rewatch
> once new requestResource called. And we can leverage on the akka heartbeat
> timeout to discover the TM failure, just like YARN mode do.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)