[ 
https://issues.apache.org/jira/browse/FLINK-15843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Canbin Zheng updated FLINK-15843:
---------------------------------
    Description: 
The current solution of stopping a TaskManager instance when JobManager sends a 
deletion request is by directly calling 
{{KubernetesClient.pods().withName().delete}}, thus that instance would be 
violently killed with a _KILL_ signal and having no chance to clean up, which 
could cause problems because we expect the process to gracefully terminate when 
it is no longer needed.

Refer to the guide of [Termination of Pods|#termination-of-pods], we know that 
on Kubernetes a _TERM_ signal would be first sent to the main process in each 
container, and may be followed up with a force _KILL_ signal if the grace 
period has expired; the Unix signal will be sent to the process which has PID 1 
([Docker Kill|https://docs.docker.com/engine/reference/commandline/kill/]), 
however, the TaskManagerRunner Process is spawned by 
{color:#172b4d}/opt/flink/bin/kubernetes-entry.sh {color}and could never have 
PID 1, so it would never receive the Unix signal_._

 

One walk around could be that JobManager firstly sends a *KILL_WORKER* message 
to the  TaskManager, then the TaskManager gracefully terminates itself to 
ensure that the clean-up is completely finished, lastly, the JobManager deletes 
the Pod after a configurable graceful shut-down period.

 

 

 

  was:
The current solution of stopping a TaskManager instance when JobManager sends a 
deletion request is by directly calling 
{{KubernetesClient.pods().withName().delete}}, thus that instance would be 
violently killed with a _KILL_ signal and having no chance to clean up, which 
could cause problems because we expect the process to gracefully terminate when 
it is no longer needed.

Refer to the guide of [Termination of Pods|#termination-of-pods]],], we know 
that on Kubernetes a _TERM_ signal would be first sent to the main process in 
each container, and may be followed up with a force _KILL_ signal if the grace 
period has expired; the Unix signal will be sent to the process which has PID 1 
([Docker Kill|https://docs.docker.com/engine/reference/commandline/kill/]), 
however, the TaskManagerRunner Process is spawned by 
{color:#172b4d}/opt/flink/bin/kubernetes-entry.sh {color}and could never have 
PID 1, so it would never receive the Unix signal_._

 

One walk around could be that JobManager firstly sends a *KILL_WORKER* message 
to the  TaskManager, then the TaskManager gracefully terminates itself to 
ensure that the clean-up is completely finished, lastly, the JobManager deletes 
the Pod after a configurable graceful shut-down period.

 

 

 


> Do not violently kill TaskManagers
> ----------------------------------
>
>                 Key: FLINK-15843
>                 URL: https://issues.apache.org/jira/browse/FLINK-15843
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.10.0
>            Reporter: Canbin Zheng
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The current solution of stopping a TaskManager instance when JobManager sends 
> a deletion request is by directly calling 
> {{KubernetesClient.pods().withName().delete}}, thus that instance would be 
> violently killed with a _KILL_ signal and having no chance to clean up, which 
> could cause problems because we expect the process to gracefully terminate 
> when it is no longer needed.
> Refer to the guide of [Termination of Pods|#termination-of-pods], we know 
> that on Kubernetes a _TERM_ signal would be first sent to the main process in 
> each container, and may be followed up with a force _KILL_ signal if the 
> grace period has expired; the Unix signal will be sent to the process which 
> has PID 1 ([Docker 
> Kill|https://docs.docker.com/engine/reference/commandline/kill/]), however, 
> the TaskManagerRunner Process is spawned by 
> {color:#172b4d}/opt/flink/bin/kubernetes-entry.sh {color}and could never have 
> PID 1, so it would never receive the Unix signal_._
>  
> One walk around could be that JobManager firstly sends a *KILL_WORKER* 
> message to the  TaskManager, then the TaskManager gracefully terminates 
> itself to ensure that the clean-up is completely finished, lastly, the 
> JobManager deletes the Pod after a configurable graceful shut-down period.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to