Paul Bramhall created AIRFLOW-4143:
--------------------------------------
Summary: Airflow unaware of worker pods being terminated or
failing without being able for Local Executor service updating the task state.
Key: AIRFLOW-4143
URL: https://issues.apache.org/jira/browse/AIRFLOW-4143
Project: Apache Airflow
Issue Type: Bug
Components: kubernetes
Affects Versions: 1.10.2
Environment: Kubernetes, Centos
Reporter: Paul Bramhall
Whenever a worker pod is terminated within Kubernetes, in a way that causes the
LocalExecutor running on that pod to exit without being able to update the task
state within airflow, AIrflow is unaware and blissfully keeps the task in a
running state. This can prevent future runs of the dag from triggering, or the
dag from retrying upon failure.
When Airflow is restarted, the task state is still marked as running & is not
updated.
There is a JobWatcher process within the Kubernetes Executor, but this has no
impact upon this condition. It will receive a Pod Event of DELETED, and still
display the task state as 'Running':
{code:java}
[2019-03-20 15:37:54,764] {{kubernetes_executor.py:296}} INFO - Event:
kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of
type MODIFIED [2019-03-20 15:37:54,764] {{kubernetes_executor.py:336}} INFO -
Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is
Running [2019-03-20 15:37:54,767] {{kubernetes_executor.py:296}} INFO - Event:
kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of
type DELETED [2019-03-20 15:37:54,767] {{kubernetes_executor.py:336}} INFO -
Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is
Running
{code}
^^ This event is if the pod is killed without the Local Executor running on
that worker updating the state. Airflow continues to show the task as Running.
Ideally we need a check within the Kubernetes Executor to perform the following
actions:
Upon Executor Startup:
# Loop through all running Dags and Task Instances
# Check for the existence of worker pods associated to these instances.
# If no pod is found, yet the task is shown as 'Running', perform a graceful
failure using the 'handle_failure()' function within Airflow itself.
During the executor running, we can perform the same, if not a similar check
from within the KubernetesJobWatcher class when the following conditions are
met:
* k8s Pod Event type is 'DELETED' _and_ Task State is still set as 'Running'
This way, if a pod does unexpectedly terminate (for example, it exceeds the
predefined resource limits and oom's), then the dag should be marked as failed
accordingly.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)