[jira] [Updated] (AIRFLOW-203) Scheduler fails to reliably schedule tasks when many dag runs are triggered

Sergei Iakhnin (JIRA) Wed, 01 Jun 2016 04:53:10 -0700

     [ 
https://issues.apache.org/jira/browse/AIRFLOW-203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergei Iakhnin updated AIRFLOW-203:
-----------------------------------
    Description: 
Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master 
node and 115 worker nodes, each with 8 cores. The workflow consists of series 
of 27 tasks, some of which are nearly instantaneous and some take hours to 
complete. Dag runs are manually triggered, about 3000 at a time, resulting in 
roughly 75 000 tasks.

My observations are that the scheduling behaviour is extremely inconsistent, 
i.e. about 1000 tasks get scheduled and executed and then no new tasks get 
scheduled after that. Sometimes it is enough to restart the scheduler for new 
tasks to get scheduled, sometimes the scheduler and worker services need to be 
restarted multiple times to get any progress. When I look at the scheduler 
output it seems to be chugging away at trying to schedule tasks with messages 
like:

"2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: airflow 
run ..."

However, these tasks do not show up in queued status on the UI and don't 
actually get scheduled out to the workers (nor make it into the rabbitmq queue).

It is unclear what may be causing this behaviour as no errors are produced 
anywhere. The impact is especially high when short-running tasks are concerned 
because the cluster should be able to blow through them within a couple of 
minutes, but instead it takes hours of manual restarts to get through them.

I'm happy to share logs or any other useful debug output as desired.

Thanks in advance.

Sergei.


  was:
Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master 
node and 115 worker nodes, each with 8 cores. The workflow consists of series 
of 27 tasks, some of which are nearly instantaneous and some take hours to 
complete. Dag runs are manually triggered, about 3000 at a time, resulting in 
roughly 75 000 tasks.

My observations are that the scheduling behaviour is extremely inconsistent, 
i.e. about 1000 tasks get scheduled and executed and then no new tasks get 
scheduled after that. Sometimes it is enough to restart the scheduler for new 
tasks to get scheduled, sometimes the scheduler and worker services need to be 
restarted multiple times to get any progress. When I look at the scheduler 
output it seems to be chugging away at trying to schedule tasks with messages 
like:

"2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: airflow 
run ..."

However, these tasks do not show up in queued status on the UI and don't 
actually get scheduled out to the workers.

It is unclear what may be causing this behaviour as no errors are produced 
anywhere. The impact is especially high when short-running tasks are concerned 
because the cluster should be able to blow through them within a couple of 
minutes, but instead it takes hours of manual restarts to get through them.

I'm happy to share logs or any other useful debug output as desired.

Thanks in advance.

Sergei.



> Scheduler fails to reliably schedule tasks when many dag runs are triggered
> ---------------------------------------------------------------------------
>
>                 Key: AIRFLOW-203
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-203
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: Airflow 1.7.1
>            Reporter: Sergei Iakhnin
>
> Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master 
> node and 115 worker nodes, each with 8 cores. The workflow consists of series 
> of 27 tasks, some of which are nearly instantaneous and some take hours to 
> complete. Dag runs are manually triggered, about 3000 at a time, resulting in 
> roughly 75 000 tasks.
> My observations are that the scheduling behaviour is extremely inconsistent, 
> i.e. about 1000 tasks get scheduled and executed and then no new tasks get 
> scheduled after that. Sometimes it is enough to restart the scheduler for new 
> tasks to get scheduled, sometimes the scheduler and worker services need to 
> be restarted multiple times to get any progress. When I look at the scheduler 
> output it seems to be chugging away at trying to schedule tasks with messages 
> like:
> "2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: 
> airflow run ..."
> However, these tasks do not show up in queued status on the UI and don't 
> actually get scheduled out to the workers (nor make it into the rabbitmq 
> queue).
> It is unclear what may be causing this behaviour as no errors are produced 
> anywhere. The impact is especially high when short-running tasks are 
> concerned because the cluster should be able to blow through them within a 
> couple of minutes, but instead it takes hours of manual restarts to get 
> through them.
> I'm happy to share logs or any other useful debug output as desired.
> Thanks in advance.
> Sergei.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AIRFLOW-203) Scheduler fails to reliably schedule tasks when many dag runs are triggered

Reply via email to