Hi, We are using airflow to establish a data pipeline that runs tasks on ephemeral amazon emr cluster. The oldest data we have is from 2014-05-26 which we have set as the start date with a scheduler interval of 1 day for airflow.
We have an s3 copy task, a map reduce task and a bunch of hive and impala load tasks in our DAG all run via PythonOperator. Our expectation is for airflow to run each of these tasks for each day from the start date till current date. Just for numbers, the number of dags that got created were approximately 800 from start date till current date (2016-07-13). All is well at the start of the execution but as it executes more and more tasks, the scheduling of tasks starts slowing down. Looks like the scheduler is spending lot of time in checking states and other houskeeping tasks. One scheduler loop is taking almost 240 to 300 seconds due to the huge number of tasks. It has been running my dags for over 24 hours now with little progress. I am starting the scheduler process with restart for every 5 runs which is the default (airflow scheduler -n 5). I did play around with different parallelism and config parameters without much help. I am looking for some assistance on making scheduler quickly and effectively schedule the tasks. Please help. Configs : parallelism = 32 dag_concurrency = 16 max_active_runs_per_dag = 99999 celeryd_concurrency = 16 scheduler_heartbeat_sec = 5 Thanks, Nadeem
