[
https://issues.apache.org/jira/browse/AIRFLOW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ash Berlin-Taylor resolved AIRFLOW-5102.
----------------------------------------
Fix Version/s: 1.10.6
Resolution: Fixed
> Workers fail to shutdown jobs after failed heartbeats
> -----------------------------------------------------
>
> Key: AIRFLOW-5102
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5102
> Project: Apache Airflow
> Issue Type: Bug
> Components: worker
> Affects Versions: 1.10.3
> Reporter: Shantanu
> Assignee: Ash Berlin-Taylor
> Priority: Major
> Fix For: 1.10.6
>
>
> If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it
> should shut itself down:
> [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/local_task_job.py#L109]
>
> However, at some point, a change was made to catch exceptions inside the
> heartbeat:
> [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/base_job.py#L194]
> LocalTaskJob now thinks heartbeats always succeed.
>
> This effectively means that zombie tasks don't shut themselves down. When the
> scheduler reschedules the job, this means we could have two instances of the
> task running concurrently.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)