[
https://issues.apache.org/jira/browse/FLINK-17176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084629#comment-17084629
]
Canbin Zheng commented on FLINK-17176:
--------------------------------------
{color:#0e101a}Thanks for the timely feedback, {color}[{color:#4a6ee0}Yang
Wang{color}|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=fly_in_gis]{color:#0e101a}!{color}
{color:#0e101a}It seems that
{color}{color:#4a6ee0}FLINK-17127{color}{color:#0e101a} is for the configurable
retry interval support and this ticket is for delaying the pod re-creation in
{{KubernetesResourceManager#onModified}} and
{{KubernetesResourceManager#onDeleted}}. WDYT?{color}
> Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler
> -------------------------------------------------------------------------
>
> Key: FLINK-17176
> URL: https://issues.apache.org/jira/browse/FLINK-17176
> Project: Flink
> Issue Type: Improvement
> Components: Deployment / Kubernetes
> Affects Versions: 1.10.0
> Reporter: Canbin Zheng
> Priority: Major
> Fix For: 1.11.0
>
>
> In the native K8s setups, there are some cases that we do not control the
> speed of pod re-creation which poses potential risks to flood the K8s API
> Server in the {{PodCallbackHandler}} implementation of
> {{KubernetesResourceManager.}}
> Here are steps to reproduce this kind of problems:
> # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on
> the K8s nodes via HostPath, make sure that the path exists but the
> TaskManager process has no write permission. We can achieve this via the
> user-specified pod template support or just hardcode it for testing only.
> # Launch a session cluster
> # Submit a new job to the session cluster, as expected, we can observe that
> the Pod constantly fails quickly during launching the main Container, then
> the {{KubernetesResourceManager#onModified}} is invoked to re-create a new
> Pod immediately, without any speed control.
> To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD*
> event and that Pod is terminated before successfully registering into the
> {{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send
> another creation request to K8s API Server immediately.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)