[ 
https://issues.apache.org/jira/browse/AIRFLOW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947006#comment-16947006
 ] 

ASF GitHub Bot commented on AIRFLOW-5102:
-----------------------------------------

ashb commented on pull request #6284: [AIRFLOW-5102] Worker jobs should 
terminate themselves if they can't heartbeat
URL: https://github.com/apache/airflow/pull/6284
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Workers fail to shutdown jobs after failed heartbeats
> -----------------------------------------------------
>
>                 Key: AIRFLOW-5102
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5102
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: worker
>    Affects Versions: 1.10.3
>            Reporter: Shantanu
>            Assignee: Ash Berlin-Taylor
>            Priority: Major
>
> If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it 
> should shut itself down: 
> [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/local_task_job.py#L109]
>  
> However, at some point, a change was made to catch exceptions inside the 
> heartbeat: 
> [https://github.com/apache/airflow/blob/f34e13a/airflow/jobs/base_job.py#L194]
> LocalTaskJob now thinks heartbeats always succeed.
>  
> This effectively means that zombie tasks don't shut themselves down. When the 
> scheduler reschedules the job, this means we could have two instances of the 
> task running concurrently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to