[
https://issues.apache.org/jira/browse/FLINK-17127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yang Wang updated FLINK-17127:
------------------------------
Description:
Follow the discussion in this PR[1].
In the current implementation, the {{POD_CREATION_RETRY_INTERVAL}} is set to
fixed value with "3s", which means when creating a taskmanager pod failed, we
will schedule a delay retry in 3s. It could work for most cases. However, we
still have a risk that too many retried of different Flink clusters will flood
to Kubernetes api server. So we need to add an initial and max setting for
retry interval, similar to
{{NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX}}.
We could add an {{ExponentialBackoff}} for the retry policy. The backoff could
be reset to initial value when a new TaskManager created successfully after
several retries.
Inspired by FLINK-17176, the pod crashed exceptionally, we should also set the
retry interval to avoid the requests floods to K8s api server. But it could be
done in a separate ticket/PR.
[1]. [https://github.com/apache/flink/pull/11427#discussion_r406318451]
was:
Follow the discussion in this PR[1].
In the current implementation, the {{POD_CREATION_RETRY_INTERVAL}} is set to
fixed value with "3s", which means when creating a taskmanager pod failed, we
will schedule a delay retry in 3s. It could work for most cases. However, we
still have a risk that too many retried of different Flink clusters will flood
to Kubernetes api server. So we need to add an initial and max setting for
retry interval, similar to
{{NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX}}.
Inspired by FLINK-17176, the pod crashed exceptionally, we should also set the
retry interval to avoid the requests floods to K8s api server.
We could add an {{ExponentialBackoff}} for the retry policy. The backoff could
be reset to initial value when a new TaskManager registered successfully, which
means creating and starting TaskManager pod could work now after several
retries.
[1]. [https://github.com/apache/flink/pull/11427#discussion_r406318451]
> Make pod creating retry interval configurable
> ---------------------------------------------
>
> Key: FLINK-17127
> URL: https://issues.apache.org/jira/browse/FLINK-17127
> Project: Flink
> Issue Type: New Feature
> Components: Deployment / Kubernetes
> Reporter: Yang Wang
> Priority: Major
>
> Follow the discussion in this PR[1].
> In the current implementation, the {{POD_CREATION_RETRY_INTERVAL}} is set to
> fixed value with "3s", which means when creating a taskmanager pod failed, we
> will schedule a delay retry in 3s. It could work for most cases. However, we
> still have a risk that too many retried of different Flink clusters will
> flood to Kubernetes api server. So we need to add an initial and max setting
> for retry interval, similar to
> {{NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX}}.
>
> We could add an {{ExponentialBackoff}} for the retry policy. The backoff
> could be reset to initial value when a new TaskManager created successfully
> after several retries.
>
> Inspired by FLINK-17176, the pod crashed exceptionally, we should also set
> the retry interval to avoid the requests floods to K8s api server. But it
> could be done in a separate ticket/PR.
>
> [1]. [https://github.com/apache/flink/pull/11427#discussion_r406318451]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)