[
https://issues.apache.org/jira/browse/AIRFLOW-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Bramhall reassigned AIRFLOW-4143:
--------------------------------------
Assignee: Paul Bramhall
> Airflow unaware of worker pods being terminated or failing without being able
> for Local Executor service updating the task state.
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-4143
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4143
> Project: Apache Airflow
> Issue Type: Bug
> Components: kubernetes
> Affects Versions: 1.10.2
> Environment: Kubernetes, Centos
> Reporter: Paul Bramhall
> Assignee: Paul Bramhall
> Priority: Major
>
> Whenever a worker pod is terminated within Kubernetes, in a way that causes
> the LocalExecutor running on that pod to exit without being able to update
> the task state within airflow, AIrflow is unaware and blissfully keeps the
> task in a running state. This can prevent future runs of the dag from
> triggering, or the dag from retrying upon failure.
> When Airflow is restarted, the task state is still marked as running & is not
> updated.
> There is a JobWatcher process within the Kubernetes Executor, but this has no
> impact upon this condition. It will receive a Pod Event of DELETED, and
> still display the task state as 'Running':
> {code:java}
> [2019-03-20 15:37:54,764] {{kubernetes_executor.py:296}} INFO - Event:
> kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event
> of type MODIFIED [2019-03-20 15:37:54,764] {{kubernetes_executor.py:336}}
> INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484
> is Running [2019-03-20 15:37:54,767] {{kubernetes_executor.py:296}} INFO -
> Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an
> event of type DELETED [2019-03-20 15:37:54,767]
> {{kubernetes_executor.py:336}} INFO - Event:
> kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is Running
> {code}
> ^^ This event is if the pod is killed without the Local Executor running on
> that worker updating the state. Airflow continues to show the task as
> Running.
> Ideally we need a check within the Kubernetes Executor to perform the
> following actions:
> Upon Executor Startup:
> # Loop through all running Dags and Task Instances
> # Check for the existence of worker pods associated to these instances.
> # If no pod is found, yet the task is shown as 'Running', perform a graceful
> failure using the 'handle_failure()' function within Airflow itself.
> During the executor running, we can perform the same, if not a similar check
> from within the KubernetesJobWatcher class when the following conditions are
> met:
> * k8s Pod Event type is 'DELETED' _and_ Task State is still set as 'Running'
> This way, if a pod does unexpectedly terminate (for example, it exceeds the
> predefined resource limits and oom's), then the dag should be marked as
> failed accordingly.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)