Kimberly Orr created AIRFLOW-6776: ------------------------------------- Summary: Celery Executor Creates Zombie Tasks on Worker Death Key: AIRFLOW-6776 URL: https://issues.apache.org/jira/browse/AIRFLOW-6776 Project: Apache Airflow Issue Type: Bug Components: celery, scheduler, worker Affects Versions: 1.10.9, 1.10.4 Environment: Slightly modified fork of airflow 1.10.4.1 with dockerized CeleryExecutor (ElastiCache Redis) on AWS EC2 instances. Reporter: Kimberly Orr Attachments: Zombie_Metrics.png
The unexpected death of a celery worker never updates the state of the task with the scheduler. If a worker dies while any tasks are running, those tasks will time out and get rescheduled on another available worker. However, the tasks are not emptied out of the executor's running state, so their state will never change (possibly preventing anything else from being scheduled, depending on configuration of parallelism). Restarting the airflow scheduler resets the running state, but is not a long-term solution. We believe this issue persists on the latest version of airflow (CeleryExecutor.sync() is still written to react silently to unexpected states). Steps to reproduce: Start a long-running dag and kill the container running airflow worker while tasks are running. The attached screenshot shows metrics gathered during the creation of a zombie. We started a dag with many long-running tasks and killed the worker between 15:25 and 15:30 (one worker, parallelism=9). -- This message was sent by Atlassian Jira (v8.3.4#803005)