[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795571#comment-17795571 ]
xiaogang zhou commented on FLINK-33728: --------------------------------------- Hi [~mapohl] , thanks for the comment above. sorry for my poor writing english :P, but I think your re-clarification is exactly what I am proposing. I'd like to introduce a lazy re-initialization of watch mechanism which will tolerate a disconnection of the watch until a new POD is requested. And I think your concern is how we detect a TM loss without a active watcher. I have test my change in a real K8S environment. With a disconnected watcher, I killed a TM pod. after no more than 50s, the task restarted with a exception {code:java} // code placeholder java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id flink-6168d34cf9d3a5d31ad8bb02bce6a370-taskmanager-1-8 timed out. at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1306) at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitC {code} moreover, I think YARN also do not have a watcher mechanism, so FLINK scheduled in yarn also relays on a heartbeat timeout mechanism? And an active rewatching strategy can really cause great pressure on API server, especially in the early versions without the resource version zero set in the watch-list request. > do not rewatch when KubernetesResourceManagerDriver watch fail > -------------------------------------------------------------- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes > Reporter: xiaogang zhou > Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)