Tamas Flamich created AIRFLOW-4881:
--------------------------------------
Summary: Zombie collection fails task instances that should be
scheduled for retry
Key: AIRFLOW-4881
URL: https://issues.apache.org/jira/browse/AIRFLOW-4881
Project: Apache Airflow
Issue Type: Bug
Components: scheduler
Affects Versions: 1.10.3
Reporter: Tamas Flamich
In case a task instance
* has more attempts than retries (it can happen when the state of the task
instance is explicitly cleared) and
* task instance is prematurely terminated (without graceful shutdown)
then zombie collection process of the scheduler can mark the task instance
failed instead of retrying it.
Steps to reproduce:
1 - The task is scheduled for a particular executed date and the following
records gets created in the database.
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|1|2|running|
2 - The job owners would like to schedule the task again therefore they clear
that state of the task instance. {{try_number}} and {{max_retries}} gets
updated.
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|2|3|running|
3 - The Airlflow scheduler gets killed and a new scheduler instance starts
looking for zombie tasks. Since {{try_number < max_tries}}, the new state is
{{up_for_retry}}. However, there is a bug in the [state update
logic|https://github.com/apache/airflow/blob/d5a5b9d9f1f1efb67ffed4d8e6ef3e0a06467bed/airflow/models/dagbag.py#L295]
that will revert the {{max_tries}} value to the initial value ({{retries}}).
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|2|2|up_for_retry|
4 - During the next iteration of the scheduler, the task instance gets picked
up. However, since {{try_number >= max_tries}}, the new state is {{failed}}.
||task_id||retries||try_number||max_tries||state||
|emr_sensor|2|2|2|failed|
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)