[jira] [Commented] (FLINK-17176) Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler

Yang Wang (Jira) Thu, 16 Apr 2020 19:31:22 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-17176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085373#comment-17085373
 ]


Yang Wang commented on FLINK-17176:
-----------------------------------

[~felixzheng] I am afraid they are almost the same case. When creating 
TaskManager pod failed or the existing TaskManager pod exited exceptionally or 
be deleted, we need to retry if necessary. Maybe they could share the same 
retry logics.

> Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler
> -------------------------------------------------------------------------
>
>                 Key: FLINK-17176
>                 URL: https://issues.apache.org/jira/browse/FLINK-17176
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.10.0
>            Reporter: Canbin Zheng
>            Priority: Major
>             Fix For: 1.11.0
>
>
> In the native K8s setups, there are some cases that we do not control the 
> speed of pod re-creation which poses potential risks to flood the K8s API 
> Server in the {{PodCallbackHandler}} implementation of 
> {{KubernetesResourceManager.}}
> Here are steps to reproduce this kind of problems:
>  # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on 
> the K8s nodes via HostPath, make sure that the path exists but the 
> TaskManager process has no write permission. We can achieve this via the 
> [user-specified pod template 
> support|https://issues.apache.org/jira/browse/FLINK-15656] or just hardcode 
> it for testing only.
>  # Launch a session cluster
>  # Submit a new job to the session cluster, as expected, we can observe that 
> the Pod constantly fails quickly during launching the main Container, then 
> the {{KubernetesResourceManager#onModified}} is invoked to re-create a new 
> Pod immediately, without any speed control.
> To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* 
> event and that Pod is terminated before successfully registering into the 
> {{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send 
> another creation request to K8s API Server immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17176) Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler

Reply via email to