[
https://issues.apache.org/jira/browse/FLINK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17908256#comment-17908256
]
Yun Tang commented on FLINK-36965:
----------------------------------
[~wangyang0918] [~xtsong] Please take a look at this ticket.
> Enable to allow re-create the pod watch with many retries on k8s cluster
> failure
> --------------------------------------------------------------------------------
>
> Key: FLINK-36965
> URL: https://issues.apache.org/jira/browse/FLINK-36965
> Project: Flink
> Issue Type: Improvement
> Components: Deployment / Kubernetes
> Affects Versions: 1.20.0
> Reporter: Yun Tang
> Priority: Major
>
> FLINK-33728 introduce the backoff strategy when creating the watch to pods.
> By doing so, we can set the
> {{kubernetes.transactional-operation.max-retries}} to a very large value to
> tolerate the k8s cluster downtime for a long time. However, there still
> exists two problems:
> 1. If we set the {{kubernetes.transactional-operation.max-retries}} to
> {{100}} + times, which means we hope the JobMaster would not crash to
> tolerate more than one hour k8s cluster downtime. However, this would also
> make the {{FlinkKubeClient#checkAndUpdateConfigMap}} much longer, which is
> not necessary.
> 2. Moreover, creating the watch to pods is not a transactional operation,
> current config option
> {{kubernetes.transactional-operation.initial-retry-delay}} and
> {{kubernetes.transactional-operation.max-retry-delay}} is misleading.
> Thus, I think we should introduce another new
> {{kubernetes.watch-operation.max-retries}} with
> {{kubernetes.watch-operation.initial-retry-delay}} and
> {{kubernetes.watch-operation.max-retry-delay}} to deprecate the previous two
> options.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)