[ 
https://issues.apache.org/jira/browse/AIRFLOW-203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergei Iakhnin updated AIRFLOW-203:
-----------------------------------
    Attachment: airflow.cfg
                airflow_scheduler_working.log
                airflow_scheduler_non_working.log

> Scheduler fails to reliably schedule tasks when many dag runs are triggered
> ---------------------------------------------------------------------------
>
>                 Key: AIRFLOW-203
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-203
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: Airflow 1.7.1.2
>            Reporter: Sergei Iakhnin
>         Attachments: airflow.cfg, airflow_scheduler_non_working.log, 
> airflow_scheduler_working.log
>
>
> Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master 
> node and 115 worker nodes, each with 8 cores. The workflow consists of series 
> of 27 tasks, some of which are nearly instantaneous and some take hours to 
> complete. Dag runs are manually triggered, about 3000 at a time, resulting in 
> roughly 75 000 tasks.
> My observations are that the scheduling behaviour is extremely inconsistent, 
> i.e. about 1000 tasks get scheduled and executed and then no new tasks get 
> scheduled after that. Sometimes it is enough to restart the scheduler for new 
> tasks to get scheduled, sometimes the scheduler and worker services need to 
> be restarted multiple times to get any progress. When I look at the scheduler 
> output it seems to be chugging away at trying to schedule tasks with messages 
> like:
> "2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: 
> airflow run ..."
> However, these tasks do not show up in queued status on the UI and don't 
> actually get scheduled out to the workers (nor make it into the rabbitmq 
> queue, or the task_instance table).
> It is unclear what may be causing this behaviour as no errors are produced 
> anywhere. The impact is especially high when short-running tasks are 
> concerned because the cluster should be able to blow through them within a 
> couple of minutes, but instead it takes hours of manual restarts to get 
> through them.
> I'm happy to share logs or any other useful debug output as desired.
> Thanks in advance.
> Sergei.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to