Canbin Zheng created FLINK-15843:
------------------------------------
Summary: Do not violently kill TaskManagers
Key: FLINK-15843
URL: https://issues.apache.org/jira/browse/FLINK-15843
Project: Flink
Issue Type: Sub-task
Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
Fix For: 1.11.0
The current solution of stopping a TaskManager instance when JobManager sends a
deletion request is by directly calling
${\{KubernetesClient.pods().withName().delete}}, thus that instance would be
violently killed with a _KILL_ signal and having no chance to clean up, which
could cause problems because we expect the process to gracefully terminate when
it is no longer needed.
Refer to the guide of [Termination of
Pods|[https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods]|https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods],],
we know that on Kubernetes a _TERM_ signal would be first sent to the main
process in each container, and may be followed up with a force _KILL_ signal if
the grace period has expired; the Unix signal will be sent to the process which
has PID 1 ([Docker
Kill|https://docs.docker.com/engine/reference/commandline/kill/]), however, the
TaskManagerRunner Process is spawned by
{color:#172b4d}/opt/flink/bin/kubernetes-entry.sh {color}and could never have
PID 1, so it would never receive the Unix signal_._
One walk around could be that JobManager firstly sends a *KILL_WORKER* message
to the TaskManager, then the TaskManager gracefully terminates itself to
ensure that the clean-up is completely finished, lastly, the JobManager deletes
the Pod after a configurable graceful shut-down period.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)