Joseph Harris created AIRFLOW-1941:
--------------------------------------

             Summary: Scheduler / executor loses tasks on restart when 
enforcing parallelism limit
                 Key: AIRFLOW-1941
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1941
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.8.1, 1.9.0
         Environment: Linux
            Reporter: Joseph Harris


When running the scheduler with a limited number of cycles - eg:
{{airflow scheduler -n 30}}
and with {{PARALLELISM=32}} set in airflow.cfg

The Executor checks that {{len(self.running) < PARALLELISM}} before calling 
{{execute_async()}}
https://github.com/apache/incubator-airflow/blob/master/airflow/executors/base_executor.py#L98
When {{self.running}} is full for an extended period of time, the scheduler can 
exit without having scheduled the remaining tasks in {{self.queued_tasks}}. 
When it restarts, the lots tasks in {{self.queued_tasks}} don't get scheduled 
again, and get stuck in the queued state until manually kicked.

We experienced issues with this when exiting tasks with clashing PIDs caused 
the CeleryExecutor's {{self.running}} to become full of zombie jobs that could 
not complete.


* The Executor should not hold 'queued' tasks for an extended period of time, 
as it may exit for any reason. The parallelism constraint should be checked 
alongside other dependencies.
* When shutting down 'gracefully', the scheduler should at least log a warning 
if there are any tasks in self.queued_tasks
* Parallelism should be set to infinity if a queue-based/distributed executor 
is being used (more risky)

This may be a common cause of tasks getting stuck in the 'queued' state when 
running Celery. 
Although AIRFLOW-900 is resolved in 1.9.0, this issue is still present, and the 
scheduler is still at risk of exiting without having scheduled tasks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to