Shantanu created AIRFLOW-5102:
---------------------------------
Summary: Workers fail to shutdown jobs after failed heartbeats
Key: AIRFLOW-5102
URL: https://issues.apache.org/jira/browse/AIRFLOW-5102
Project: Apache Airflow
Issue Type: Bug
Components: worker
Affects Versions: 1.10.3
Reporter: Shantanu
Assignee: Shantanu
If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it
should shut itself down:
[https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/local_task_job.py#L109]
However, at some point, a change was made to catch exceptions inside the
heartbeat:
[https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/base_job.py#L194]
LocalTaskJob now thinks heartbeats always succeed.
This effectively means that zombie tasks don't shut themselves down. When the
scheduler reschedules the job, this means we could have two instances of the
task running concurrently.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)