[jira] [Assigned] (AIRFLOW-4143) Airflow unaware of worker pods being terminated or failing without being able for Local Executor service updating the task state.

Paul Bramhall (JIRA) Fri, 22 Mar 2019 09:09:30 -0700


     [ 
https://issues.apache.org/jira/browse/AIRFLOW-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Bramhall reassigned AIRFLOW-4143:
--------------------------------------

    Assignee: Paul Bramhall

> Airflow unaware of worker pods being terminated or failing without being able 
> for Local Executor service updating the task state.
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4143
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4143
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: kubernetes
>    Affects Versions: 1.10.2
>         Environment: Kubernetes, Centos
>            Reporter: Paul Bramhall
>            Assignee: Paul Bramhall
>            Priority: Major
>
> Whenever a worker pod is terminated within Kubernetes, in a way that causes 
> the LocalExecutor running on that pod to exit without being able to update 
> the task state within airflow, AIrflow is unaware and blissfully keeps the 
> task in a running state.  This can prevent future runs of the dag from 
> triggering, or the dag from retrying upon failure.
> When Airflow is restarted, the task state is still marked as running & is not 
> updated.
> There is a JobWatcher process within the Kubernetes Executor, but this has no 
> impact upon this condition.  It will receive a Pod Event of DELETED, and 
> still display the task state as 'Running':
> {code:java}
> [2019-03-20 15:37:54,764] {{kubernetes_executor.py:296}} INFO - Event: 
> kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event 
> of type MODIFIED [2019-03-20 15:37:54,764] {{kubernetes_executor.py:336}} 
> INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 
> is Running [2019-03-20 15:37:54,767] {{kubernetes_executor.py:296}} INFO - 
> Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an 
> event of type DELETED [2019-03-20 15:37:54,767] 
> {{kubernetes_executor.py:336}} INFO - Event: 
> kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is Running
> {code}
> ^^ This event is if the pod is killed without the Local Executor running on 
> that worker updating the state.  Airflow continues to show the task as 
> Running.
> Ideally we need a check within the Kubernetes Executor to perform the 
> following actions:
> Upon Executor Startup:
>  # Loop through all running Dags and Task Instances
>  # Check for the existence of worker pods associated to these instances.
>  # If no pod is found, yet the task is shown as 'Running', perform a graceful 
> failure using the 'handle_failure()' function within Airflow itself.
> During the executor running, we can perform the same, if not a similar check 
> from within the KubernetesJobWatcher class when the following conditions are 
> met:
>  * k8s Pod Event type is 'DELETED' _and_ Task State is still set as 'Running'
> This way, if a pod does unexpectedly terminate (for example, it exceeds the 
> predefined resource limits and oom's), then the dag should be marked as 
> failed accordingly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AIRFLOW-4143) Airflow unaware of worker pods being terminated or failing without being able for Local Executor service updating the task state.

Reply via email to