[jira] [Commented] (FLINK-15843) Do not violently kill TaskManagers on Kubernetes

Till Rohrmann (Jira) Wed, 05 Feb 2020 08:34:28 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-15843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030786#comment-17030786
 ]


Till Rohrmann commented on FLINK-15843:
---------------------------------------

Ah ok, I see. I guess it would not hurt to have a graceful shutdown. In 
particular, it seems to be relevant if you configured local recovery to use a 
persistent volume. But I don't think that this has a super high priority right 
now.

> Do not violently kill TaskManagers on Kubernetes
> ------------------------------------------------
>
>                 Key: FLINK-15843
>                 URL: https://issues.apache.org/jira/browse/FLINK-15843
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.10.0
>            Reporter: Canbin Zheng
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The current solution of stopping a TaskManager instance when JobManager sends 
> a deletion request is by directly calling 
> {{KubernetesClient.pods().withName().delete}}, thus that instance would be 
> violently killed with a _KILL_ signal and having no chance to clean up, which 
> could cause problems because we expect the process to gracefully terminate 
> when it is no longer needed.
> Refer to the guide of [Termination of Pods|#termination-of-pods], we know 
> that on Kubernetes a _TERM_ signal would be first sent to the main process in 
> each container, and may be followed up with a force _KILL_ signal if the 
> graceful shut-down period has expired; the Unix signal will be sent to the 
> process which has PID 1 ([Docker 
> Kill|https://docs.docker.com/engine/reference/commandline/kill/]), however, 
> the TaskManagerRunner process is spawned by 
> {color:#172b4d}/opt/flink/bin/kubernetes-entry.sh {color}and could never have 
> PID 1, so it would never receive the Unix signal.
>  
> One walk around could be that JobManager firstly sends a *KILL_WORKER* 
> message to the TaskManager, then the TaskManager gracefully terminates itself 
> to ensure that the clean-up is completely finished, lastly, the JobManager 
> deletes the Pod after a configurable graceful shut-down period.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15843) Do not violently kill TaskManagers on Kubernetes

Reply via email to