[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

xiaogang zhou (Jira) Mon, 11 Dec 2023 18:39:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795571#comment-17795571
 ]


xiaogang zhou commented on FLINK-33728:
---------------------------------------

Hi [~mapohl] , thanks for the comment above. sorry for my poor writing english 
:P, but I think your re-clarification  is exactly what I am proposing. I'd like 
to introduce a lazy re-initialization of watch mechanism which will tolerate a 
disconnection of the watch until a new POD is requested.

And I think your concern is how we detect a TM loss without a active watcher.  
I have test my change in a real K8S environment. With a disconnected watcher, I 
killed a TM pod. after no more than 50s, the task restarted with a exception
{code:java}
// code placeholder
 java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 
flink-6168d34cf9d3a5d31ad8bb02bce6a370-taskmanager-1-8 timed out.
        at 
org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1306)
        at 
org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
        at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
        at akka.japi.pf.UnitC {code}
moreover, I think YARN also do not have a watcher mechanism, so FLINK scheduled 
in yarn also relays on a heartbeat timeout mechanism? 

 

And an active rewatching strategy can really cause great pressure on API 
server, especially in the early versions without the resource version zero set 
in the watch-list request.

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>              Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

Reply via email to