[ 
https://issues.apache.org/jira/browse/FLINK-17127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated FLINK-17127:
------------------------------
    Description: 
Follow the discussion in this PR[1].

In the current implementation, the {{POD_CREATION_RETRY_INTERVAL}} is set to 
fixed value with "3s", which means when creating a taskmanager pod failed, we 
will schedule a delay retry in 3s. It could work for most cases. However, we 
still have a risk that too many retried of different Flink clusters will flood 
to Kubernetes api server. So we need to add an initial and max setting for 
retry interval, similar to 
{{NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX}}.

 

Inspired by FLINK-17176, the pod crashed exceptionally, we should also set the 
retry interval to avoid the requests floods to K8s api server.

 

We could add an {{ExponentialBackoff}} for the retry policy. The backoff could 
be reset to initial value when a new TaskManager registered successfully, which 
means creating and starting TaskManager pod could work now after several 
retries.

 

[1]. [https://github.com/apache/flink/pull/11427#discussion_r406318451]

  was:
Follow the discussion in this PR[1].

In the current implementation, the {{POD_CREATION_RETRY_INTERVAL}} is set to 
fixed value with "3s", which means when creating a taskmanager pod failed, we 
will schedule a delay retry in 3s. It could work for most cases. However, we 
still have a risk that too many retried of different Flink clusters will flood 
to Kubernetes api server. So we need to add an initial and max setting for 
retry interval, similar to 
{{NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX}}.

 

 

[1]. https://github.com/apache/flink/pull/11427#discussion_r406318451


> Make pod creating retry interval configurable
> ---------------------------------------------
>
>                 Key: FLINK-17127
>                 URL: https://issues.apache.org/jira/browse/FLINK-17127
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: Yang Wang
>            Priority: Major
>
> Follow the discussion in this PR[1].
> In the current implementation, the {{POD_CREATION_RETRY_INTERVAL}} is set to 
> fixed value with "3s", which means when creating a taskmanager pod failed, we 
> will schedule a delay retry in 3s. It could work for most cases. However, we 
> still have a risk that too many retried of different Flink clusters will 
> flood to Kubernetes api server. So we need to add an initial and max setting 
> for retry interval, similar to 
> {{NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX}}.
>  
> Inspired by FLINK-17176, the pod crashed exceptionally, we should also set 
> the retry interval to avoid the requests floods to K8s api server.
>  
> We could add an {{ExponentialBackoff}} for the retry policy. The backoff 
> could be reset to initial value when a new TaskManager registered 
> successfully, which means creating and starting TaskManager pod could work 
> now after several retries.
>  
> [1]. [https://github.com/apache/flink/pull/11427#discussion_r406318451]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to