[jira] [Updated] (FLINK-15836) Start a new pods watcher in KubernetesResourceManager when the old one is closed with exception

Yang Wang (Jira) Wed, 15 Apr 2020 21:51:08 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yang Wang updated FLINK-15836:
------------------------------
    Description: 
As the discussion in the PR[1], if the {{watchReconnectLimit}} is configured by 
users via java properties or environment, the watch may be stopped and all the 
changes will not be processed properly. So we need to throw a fatal exception 
in {{KubernetesResourceManager}} when the old one is closed with exception.

 

> Why do we not retry in {{KubernetesResourceManager}} when watcher closed 
> exceptionally？

After checking the {{WatchConnectionManager}} implementation in fabric8 
kubernetes client, if the web socket closed exceptionally, it will check the 
{{reconnectLimit}} and schedule a reconnect if needed. And when reconnect 
successfully, the {{currentReconnectAttempt}} will reset to 0. So if the users 
explicitly specify the {{reconnectLimit}}, we should respect it. The reason why 
the the web socket closed exceptionally is usually because of network problems 
or port abuse. The correct way is to fail the jobmanager pod and retry in a new 
one.

 

[1]. [https://github.com/apache/flink/pull/10965#discussion_r373491974]

  was:
As the discussion in the PR[1], if the {{watchReconnectLimit}} is configured by 
users via java properties or environment, the watch may be stopped and all the 
changes will not be processed properly. So we need to start a new pods watcher 
in {{KubernetesResourceManager}} when the old one is closed with exception.

[1]. [https://github.com/apache/flink/pull/10965#discussion_r373491974]


> Start a new pods watcher in KubernetesResourceManager when the old one is 
> closed with exception
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15836
>                 URL: https://issues.apache.org/jira/browse/FLINK-15836
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Deployment / Kubernetes
>            Reporter: Yang Wang
>            Assignee: Yang Wang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> As the discussion in the PR[1], if the {{watchReconnectLimit}} is configured 
> by users via java properties or environment, the watch may be stopped and all 
> the changes will not be processed properly. So we need to throw a fatal 
> exception in {{KubernetesResourceManager}} when the old one is closed with 
> exception.
>  
> > Why do we not retry in {{KubernetesResourceManager}} when watcher closed 
> > exceptionally？
> After checking the {{WatchConnectionManager}} implementation in fabric8 
> kubernetes client, if the web socket closed exceptionally, it will check the 
> {{reconnectLimit}} and schedule a reconnect if needed. And when reconnect 
> successfully, the {{currentReconnectAttempt}} will reset to 0. So if the 
> users explicitly specify the {{reconnectLimit}}, we should respect it. The 
> reason why the the web socket closed exceptionally is usually because of 
> network problems or port abuse. The correct way is to fail the jobmanager pod 
> and retry in a new one.
>  
> [1]. [https://github.com/apache/flink/pull/10965#discussion_r373491974]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-15836) Start a new pods watcher in KubernetesResourceManager when the old one is closed with exception

Reply via email to