Paul Bramhall created AIRFLOW-4143:
--------------------------------------

             Summary: Airflow unaware of worker pods being terminated or 
failing without being able for Local Executor service updating the task state.
                 Key: AIRFLOW-4143
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4143
             Project: Apache Airflow
          Issue Type: Bug
          Components: kubernetes
    Affects Versions: 1.10.2
         Environment: Kubernetes, Centos
            Reporter: Paul Bramhall


Whenever a worker pod is terminated within Kubernetes, in a way that causes the 
LocalExecutor running on that pod to exit without being able to update the task 
state within airflow, AIrflow is unaware and blissfully keeps the task in a 
running state.  This can prevent future runs of the dag from triggering, or the 
dag from retrying upon failure.

When Airflow is restarted, the task state is still marked as running & is not 
updated.

There is a JobWatcher process within the Kubernetes Executor, but this has no 
impact upon this condition.  It will receive a Pod Event of DELETED, and still 
display the task state as 'Running':


{code:java}
[2019-03-20 15:37:54,764] {{kubernetes_executor.py:296}} INFO - Event: 
kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of 
type MODIFIED [2019-03-20 15:37:54,764] {{kubernetes_executor.py:336}} INFO - 
Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is 
Running [2019-03-20 15:37:54,767] {{kubernetes_executor.py:296}} INFO - Event: 
kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of 
type DELETED [2019-03-20 15:37:54,767] {{kubernetes_executor.py:336}} INFO - 
Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is 
Running
{code}
^^ This event is if the pod is killed without the Local Executor running on 
that worker updating the state.  Airflow continues to show the task as Running.

Ideally we need a check within the Kubernetes Executor to perform the following 
actions:

Upon Executor Startup:
 # Loop through all running Dags and Task Instances
 # Check for the existence of worker pods associated to these instances.
 # If no pod is found, yet the task is shown as 'Running', perform a graceful 
failure using the 'handle_failure()' function within Airflow itself.

During the executor running, we can perform the same, if not a similar check 
from within the KubernetesJobWatcher class when the following conditions are 
met:
 * k8s Pod Event type is 'DELETED' _and_ Task State is still set as 'Running'

This way, if a pod does unexpectedly terminate (for example, it exceeds the 
predefined resource limits and oom's), then the dag should be marked as failed 
accordingly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to