Joseph Harris created AIRFLOW-1941:
--------------------------------------
Summary: Scheduler / executor loses tasks on restart when
enforcing parallelism limit
Key: AIRFLOW-1941
URL: https://issues.apache.org/jira/browse/AIRFLOW-1941
Project: Apache Airflow
Issue Type: Bug
Components: scheduler
Affects Versions: 1.8.1, 1.9.0
Environment: Linux
Reporter: Joseph Harris
When running the scheduler with a limited number of cycles - eg:
{{airflow scheduler -n 30}}
and with {{PARALLELISM=32}} set in airflow.cfg
The Executor checks that {{len(self.running) < PARALLELISM}} before calling
{{execute_async()}}
https://github.com/apache/incubator-airflow/blob/master/airflow/executors/base_executor.py#L98
When {{self.running}} is full for an extended period of time, the scheduler can
exit without having scheduled the remaining tasks in {{self.queued_tasks}}.
When it restarts, the lots tasks in {{self.queued_tasks}} don't get scheduled
again, and get stuck in the queued state until manually kicked.
We experienced issues with this when exiting tasks with clashing PIDs caused
the CeleryExecutor's {{self.running}} to become full of zombie jobs that could
not complete.
* The Executor should not hold 'queued' tasks for an extended period of time,
as it may exit for any reason. The parallelism constraint should be checked
alongside other dependencies.
* When shutting down 'gracefully', the scheduler should at least log a warning
if there are any tasks in self.queued_tasks
* Parallelism should be set to infinity if a queue-based/distributed executor
is being used (more risky)
This may be a common cause of tasks getting stuck in the 'queued' state when
running Celery.
Although AIRFLOW-900 is resolved in 1.9.0, this issue is still present, and the
scheduler is still at risk of exiting without having scheduled tasks
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)