[
https://issues.apache.org/jira/browse/AIRFLOW-203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergei Iakhnin updated AIRFLOW-203:
-----------------------------------
Attachment: airflow.cfg
airflow_scheduler_working.log
airflow_scheduler_non_working.log
> Scheduler fails to reliably schedule tasks when many dag runs are triggered
> ---------------------------------------------------------------------------
>
> Key: AIRFLOW-203
> URL: https://issues.apache.org/jira/browse/AIRFLOW-203
> Project: Apache Airflow
> Issue Type: Bug
> Components: scheduler
> Affects Versions: Airflow 1.7.1.2
> Reporter: Sergei Iakhnin
> Attachments: airflow.cfg, airflow_scheduler_non_working.log,
> airflow_scheduler_working.log
>
>
> Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master
> node and 115 worker nodes, each with 8 cores. The workflow consists of series
> of 27 tasks, some of which are nearly instantaneous and some take hours to
> complete. Dag runs are manually triggered, about 3000 at a time, resulting in
> roughly 75 000 tasks.
> My observations are that the scheduling behaviour is extremely inconsistent,
> i.e. about 1000 tasks get scheduled and executed and then no new tasks get
> scheduled after that. Sometimes it is enough to restart the scheduler for new
> tasks to get scheduled, sometimes the scheduler and worker services need to
> be restarted multiple times to get any progress. When I look at the scheduler
> output it seems to be chugging away at trying to schedule tasks with messages
> like:
> "2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue:
> airflow run ..."
> However, these tasks do not show up in queued status on the UI and don't
> actually get scheduled out to the workers (nor make it into the rabbitmq
> queue, or the task_instance table).
> It is unclear what may be causing this behaviour as no errors are produced
> anywhere. The impact is especially high when short-running tasks are
> concerned because the cluster should be able to blow through them within a
> couple of minutes, but instead it takes hours of manual restarts to get
> through them.
> I'm happy to share logs or any other useful debug output as desired.
> Thanks in advance.
> Sergei.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)