Kimberly Orr created AIRFLOW-6776:
-------------------------------------

             Summary: Celery Executor Creates Zombie Tasks on Worker Death
                 Key: AIRFLOW-6776
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6776
             Project: Apache Airflow
          Issue Type: Bug
          Components: celery, scheduler, worker
    Affects Versions: 1.10.9, 1.10.4
         Environment: Slightly modified fork of airflow 1.10.4.1 with 
dockerized CeleryExecutor (ElastiCache Redis) on AWS EC2 instances.
            Reporter: Kimberly Orr
         Attachments: Zombie_Metrics.png

The unexpected death of a celery worker never updates the state of the task 
with the scheduler. If a worker dies while any tasks are running, those tasks 
will time out and get rescheduled on another available worker. However, the 
tasks are not emptied out of the executor's running state, so their state will 
never change (possibly preventing anything else from being scheduled, depending 
on configuration of parallelism). Restarting the airflow scheduler resets the 
running state, but is not a long-term solution.
 
We believe this issue persists on the latest version of airflow 
(CeleryExecutor.sync() is still written to react silently to unexpected states).
 
Steps to reproduce:
Start a long-running dag and kill the container running airflow worker while 
tasks are running.
 
The attached screenshot shows metrics gathered during the creation of a zombie. 
We started a dag with many long-running tasks and killed the worker between 
15:25 and 15:30 (one worker, parallelism=9).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to