[jira] [Commented] (FLINK-36965) Enable to allow re-create the pod watch with many retries on k8s cluster failure

Yun Tang (Jira) Wed, 25 Dec 2024 22:41:37 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17908256#comment-17908256
 ]


Yun Tang commented on FLINK-36965:
----------------------------------

[~wangyang0918] [~xtsong] Please take a look at this ticket.

> Enable to allow re-create the pod watch with many retries on k8s cluster 
> failure
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-36965
>                 URL: https://issues.apache.org/jira/browse/FLINK-36965
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.20.0
>            Reporter: Yun Tang
>            Priority: Major
>
> FLINK-33728 introduce the backoff strategy when creating the watch to pods. 
> By doing so, we can set the 
> {{kubernetes.transactional-operation.max-retries}} to a very large value to 
> tolerate the k8s cluster downtime for a long time. However, there still 
> exists two problems:
> 1. If we set the {{kubernetes.transactional-operation.max-retries}} to 
> {{100}} + times, which means we hope the JobMaster would not crash to 
> tolerate more than one hour k8s cluster downtime. However, this would also 
> make the {{FlinkKubeClient#checkAndUpdateConfigMap}} much longer, which is 
> not necessary.
> 2. Moreover, creating the watch to pods is not a transactional operation, 
> current config option 
> {{kubernetes.transactional-operation.initial-retry-delay}} and 
> {{kubernetes.transactional-operation.max-retry-delay}} is misleading.
> Thus, I think we should introduce another new 
> {{kubernetes.watch-operation.max-retries}} with 
> {{kubernetes.watch-operation.initial-retry-delay}} and 
> {{kubernetes.watch-operation.max-retry-delay}} to deprecate the previous two 
> options.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36965) Enable to allow re-create the pod watch with many retries on k8s cluster failure

Reply via email to