Tamas Flamich created AIRFLOW-4881:
--------------------------------------

             Summary: Zombie collection fails task instances that should be 
scheduled for retry
                 Key: AIRFLOW-4881
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4881
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.10.3
            Reporter: Tamas Flamich


In case a task instance
 * has more attempts than retries (it can happen when the state of the task 
instance is explicitly cleared) and
 * task instance is prematurely terminated (without graceful shutdown)

then zombie collection process of the scheduler can mark the task instance 
failed instead of retrying it. 


Steps to reproduce:

1 - The task is scheduled for a particular executed date and the following 
records gets created in the database.
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|1|2|running|

2 - The job owners would like to schedule the task again therefore they clear 
that state of the task instance. {{try_number}} and {{max_retries}} gets 
updated.
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|2|3|running|

3 - The Airlflow scheduler gets killed and a new scheduler instance starts 
looking for zombie tasks. Since {{try_number < max_tries}}, the new state is 
{{up_for_retry}}. However, there is a bug in the [state update 
logic|https://github.com/apache/airflow/blob/d5a5b9d9f1f1efb67ffed4d8e6ef3e0a06467bed/airflow/models/dagbag.py#L295]
 that will revert the {{max_tries}} value to the initial value ({{retries}}).
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|2|2|up_for_retry|

4 - During the next iteration of the scheduler, the task instance gets picked 
up. However, since {{try_number >= max_tries}}, the new state is {{failed}}.
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|2|2|failed|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to