[ 
https://issues.apache.org/jira/browse/AIRFLOW-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982469#comment-16982469
 ] 

Shlomi Cohen commented on AIRFLOW-203:
--------------------------------------

Hi

i know this post is old but 3 year later, we face the same problem with version 
1.10-5 . 
we have 1 DAG and want to launch hundreds of dag runs for it (manually with 
different configuration) , 

even if taking the example dag and launching 100 runs for it - stuck the 
scheduler and it needs to be restarted.

any help will be appreciated - cause i have played with every possible 
configuration airflow has to offer and still get to this problem.

thanks

Shlomi

> Scheduler fails to reliably schedule tasks when many dag runs are triggered
> ---------------------------------------------------------------------------
>
>                 Key: AIRFLOW-203
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-203
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.7.1.2
>            Reporter: Sergei Iakhnin
>            Priority: Major
>         Attachments: airflow.cfg, airflow_scheduler_non_working.log, 
> airflow_scheduler_working.log
>
>
> Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master 
> node and 115 worker nodes, each with 8 cores. The workflow consists of series 
> of 27 tasks, some of which are nearly instantaneous and some take hours to 
> complete. Dag runs are manually triggered, about 3000 at a time, resulting in 
> roughly 75 000 tasks.
> My observations are that the scheduling behaviour is extremely inconsistent, 
> i.e. about 1000 tasks get scheduled and executed and then no new tasks get 
> scheduled after that. Sometimes it is enough to restart the scheduler for new 
> tasks to get scheduled, sometimes the scheduler and worker services need to 
> be restarted multiple times to get any progress. When I look at the scheduler 
> output it seems to be chugging away at trying to schedule tasks with messages 
> like:
> "2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: 
> airflow run ..."
> However, these tasks do not show up in queued status on the UI and don't 
> actually get scheduled out to the workers (nor make it into the rabbitmq 
> queue, or the task_instance table).
> It is unclear what may be causing this behaviour as no errors are produced 
> anywhere. The impact is especially high when short-running tasks are 
> concerned because the cluster should be able to blow through them within a 
> couple of minutes, but instead it takes hours of manual restarts to get 
> through them.
> I'm happy to share logs or any other useful debug output as desired.
> Thanks in advance.
> Sergei.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to