Yun Tang created FLINK-36965:
--------------------------------
Summary: Enable to allow re-create the pod watch with many retries
on k8s cluster failure
Key: FLINK-36965
URL: https://issues.apache.org/jira/browse/FLINK-36965
Project: Flink
Issue Type: Improvement
Components: Deployment / Kubernetes
Affects Versions: 1.20.0
Reporter: Yun Tang
FLINK-33728 introduce the backoff strategy when creating the watch to pods. By
doing so, we can set the {{kubernetes.transactional-operation.max-retries}} to
a very large value to tolerate the k8s cluster downtime for a long time.
However, there still exists two problems:
1. If we set the {{kubernetes.transactional-operation.max-retries}} to {{100}}
+ times, which means we hope the JobMaster would not crash to tolerate more
than one hour k8s cluster downtime. However, this would also make the
{{FlinkKubeClient#checkAndUpdateConfigMap}} much longer, which is not necessary.
2. Moreover, creating the watch to pods is not a transactional operation,
current config option
{{kubernetes.transactional-operation.initial-retry-delay}} and
{{kubernetes.transactional-operation.max-retry-delay}} is misleading.
Thus, I think we should introduce another new
{{kubernetes.watch-operation.max-retries}} with
{{kubernetes.watch-operation.initial-retry-delay}} and
{{kubernetes.watch-operation.max-retry-delay}} to deprecate the previous two
options.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)