Kimberly Orr created AIRFLOW-6776:
-------------------------------------
Summary: Celery Executor Creates Zombie Tasks on Worker Death
Key: AIRFLOW-6776
URL: https://issues.apache.org/jira/browse/AIRFLOW-6776
Project: Apache Airflow
Issue Type: Bug
Components: celery, scheduler, worker
Affects Versions: 1.10.9, 1.10.4
Environment: Slightly modified fork of airflow 1.10.4.1 with
dockerized CeleryExecutor (ElastiCache Redis) on AWS EC2 instances.
Reporter: Kimberly Orr
Attachments: Zombie_Metrics.png
The unexpected death of a celery worker never updates the state of the task
with the scheduler. If a worker dies while any tasks are running, those tasks
will time out and get rescheduled on another available worker. However, the
tasks are not emptied out of the executor's running state, so their state will
never change (possibly preventing anything else from being scheduled, depending
on configuration of parallelism). Restarting the airflow scheduler resets the
running state, but is not a long-term solution.
We believe this issue persists on the latest version of airflow
(CeleryExecutor.sync() is still written to react silently to unexpected states).
Steps to reproduce:
Start a long-running dag and kill the container running airflow worker while
tasks are running.
The attached screenshot shows metrics gathered during the creation of a zombie.
We started a dag with many long-running tasks and killed the worker between
15:25 and 15:30 (one worker, parallelism=9).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)