[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090578#comment-17090578
 ] 

Till Rohrmann commented on FLINK-17273:
---------------------------------------

I think part of the problem why we missed to call this function is that 
{{ResourceManager}} does not enforce a certain control flow. I think it would 
be better if the {{ResourceManager}} offered some calls like 
{{notifyWorkerFailed}} which will trigger the failover behaviour controlled by 
the {{ResourceManager}} and not by the sub class. In order to make this work, I 
guess we should take a look at the overall architecture and think about what 
callbacks the {{ResourceManager}} would need in order to do its job. Then the 
{{ResourceManager}} should be responsible for reacting to failures and other 
signals and simply call the implementation specific callbacks (e.g. terminating 
a pod). In contrast to that, our current {{ResourceManager}} implementations 
handle most of the logic themselves which can lead to problems such as 
forgetting to call a method in order to not violate the contract.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-17273
>                 URL: https://issues.apache.org/jira/browse/FLINK-17273
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes, Runtime / Coordination
>    Affects Versions: 1.10.0, 1.10.1
>            Reporter: Canbin Zheng
>            Assignee: Canbin Zheng
>            Priority: Major
>             Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to