Nadeem,

Unfortunately this slowness is currently a deficit in the scheduler. It will be 
addressed 
in the future, but obviously we are not there yet. To make it more manageable 
you could 
use end_date for the dag and create multiple dags for it, keeping the logic the 
same but 
the dag_id and the start-date / end_date different. If you are on 1.7.1.3 you 
will then benefit
from multiprocessing (max_threads for the scheduler). In addition you add load 
by hand then.
Not ideal but it will work.

Also depending the speed of your tasks finishing you could limit the heartbeat 
so the scheduler
does not run redundantly while not being able to fire off new tasks.

In addition why are you using num_runs? I definitely do not recommend using it 
with a 
LocalExecutor and if you are on 1.7.1.3 I would not use it with Celery either.

I hope this helps!

Bolke

> Op 13 jul. 2016, om 10:43 heeft Nadeem Ahmed Nazeer <[email protected]> het 
> volgende geschreven:
> 
> Hi,
> 
> We are using airflow to establish a data pipeline that runs tasks on
> ephemeral amazon emr cluster. The oldest data we have is from 2014-05-26
> which we have set as the start date with a scheduler interval of 1 day for
> airflow.
> 
> We have an s3 copy task, a map reduce task and a bunch of hive and impala
> load tasks in our DAG all run via PythonOperator. Our expectation is for
> airflow to run each of these tasks for each day from the start date till
> current date.
> 
> Just for numbers, the number of dags that got created were approximately
> 800 from start date till current date (2016-07-13). All is well at the
> start of the execution but as it executes more and more tasks, the
> scheduling of tasks starts slowing down. Looks like the scheduler is
> spending lot of time in checking states and other houskeeping tasks.
> 
> One scheduler loop is taking almost 240 to 300 seconds due to the huge
> number of tasks. It has been running my dags for over 24 hours now with
> little progress. I am starting the scheduler process with restart for every
> 5 runs which is the default (airflow scheduler -n 5).
> 
> I did play around with different parallelism and config parameters without
> much help. I am looking for some assistance on making scheduler quickly and
> effectively schedule the tasks. Please help.
> 
> Configs :
> parallelism = 32
> dag_concurrency = 16
> max_active_runs_per_dag = 99999
> celeryd_concurrency = 16
> scheduler_heartbeat_sec = 5
> 
> Thanks,
> Nadeem

Reply via email to