[ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804361#comment-17804361
 ] 

Matthias Pohl commented on FLINK-33728:
---------------------------------------

Hi [~zhoujira86], thanks for your patience. I managed to look into the issue 
once more. I'm not sure whether your proposal would work. As mentioned already 
in previous comments, we would lose the pods that are terminated. This might be 
handled by the TaskManager's heartbeats in some way. But I am concerned about 
the termination logic of a pod: The termination event for pods calls 
[onPodTerminated|https://github.com/apache/flink/blob/b865151d23ef92879941a63f40c9fac7c6b9b98c/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesResourceManagerDriver.java#L416]
 which deals with cleaning up the corresponding resource request (see 
[KubernetesResourceManagerDriver:424ff|https://github.com/apache/flink/blob/b865151d23ef92879941a63f40c9fac7c6b9b98c/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesResourceManagerDriver.java#L424].
 To me it looks like there could be a scenario where we do not complete open 
requests properly.

That seems to be an edge case because we need to lose the watcher after the 
request was initiated but before the request is fulfilled, if I understand the 
code correctly. But anyway, that sounds like a possible memory issue due to 
missing cleanup. [~xtsong] can you help on this issue?

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>              Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to