and https://github.com/apache/incubator-airflow/pull/1735
On Mon, Aug 15, 2016 at 7:33 PM, siddharth anand <[email protected]> wrote: > Nice idea.... > > https://issues.apache.org/jira/browse/AIRFLOW-431 > > On Mon, Aug 8, 2016 at 4:54 AM, Jeremiah Lowin <[email protected]> wrote: > >> Sure, just modify this code: >> >> import airflow >> from airflow.models import Pool >> sess = airflow.settings.session() >> >> pool = ( >> sess.query(Pool) >> .filter(Pool.pool=='my_pool') >> .first()) >> >> if not pool: >> session.add( >> Pool( >> pool='my_pool', >> slots=8, >> description='this is my pool' >> ) >> ) >> session.commit() >> >> >> >> On Sun, Aug 7, 2016 at 4:37 PM Nadeem Ahmed Nazeer <[email protected]> >> wrote: >> >> > Could we create a pool programmatically instead of manually creating >> from >> > UI? I want to create this pool from the chef script when airflow starts >> up. >> > >> > Thanks, >> > Nadeem >> > >> > On Wed, Jul 13, 2016 at 5:21 PM, Lance Norskog <[email protected] >> > >> > wrote: >> > >> > > Nazeer- "If I don't use num_runs, scheduler would just stop after >> running >> > > some number of tasks and I can't figure out why." >> > > This is a known bug. >> > > >> > > One way to help this scheduling is to create a Pool. A Pool is a queue >> > > gatekeeper that allows at most N tasks to run concurrently. If you set >> > the >> > > Pool size to, say, 5-10 and make all tasks join that pool, then only >> that >> > > many tasks will run. The point of Pools is to regulate access to >> > contested >> > > resources. In this case, all of your external services (S3, Hadoop) >> are >> > > contested resources. In this case, you may have 30 S3 jobs running at >> > once >> > > and 50 M/R jobs trying to run. You will find this all runs more >> smoothly >> > > when you control the number of active tasks using a resource. >> > > >> > > Another technique is that either a DAG or a task (I can't remember >> which) >> > > can wait until previous days finish. This is another way to regulate >> the >> > > flow of tasks. >> > > >> > > After all, you would not do this in the shell: >> > > >> > > for x in 500 hive scripts >> > > do >> > > hive -f $x & >> > > done >> > > >> > > This is exactly what Airflow is doing with out-of-control tasks. >> > > >> > > Lance >> > > >> > > On Wed, Jul 13, 2016 at 11:18 AM, Nadeem Ahmed Nazeer < >> > [email protected] >> > > > >> > > wrote: >> > > >> > > > Thanks for the response Bolke. Looking forward to have this slowness >> > with >> > > > the scheduler fixed in the future airflow releases. I am currently >> on >> > > > version 1.7.0, will upgrade to 1.7.1.3 and also try your >> suggestions. >> > > > >> > > > I am using CeleryExecutor. If I don't use num_runs, scheduler would >> > just >> > > > stop after running some number of tasks and I can't figure out why. >> The >> > > > scheduler would only start running after I restart the service >> > manually. >> > > > The fix to that was to add this parameter. I found the num_tasks >> > > parameter >> > > > used in the upstart script for the scheduler by default and also >> read >> > in >> > > > the manual to use this ( >> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls >> ). >> > > > >> > > > Thanks, >> > > > Nadeem >> > > > >> > > > On Wed, Jul 13, 2016 at 8:51 AM, Bolke de Bruin <[email protected]> >> > > wrote: >> > > > >> > > > > Nadeem, >> > > > > >> > > > > Unfortunately this slowness is currently a deficit in the >> scheduler. >> > It >> > > > > will be addressed >> > > > > in the future, but obviously we are not there yet. To make it more >> > > > > manageable you could >> > > > > use end_date for the dag and create multiple dags for it, keeping >> the >> > > > > logic the same but >> > > > > the dag_id and the start-date / end_date different. If you are on >> > > 1.7.1.3 >> > > > > you will then benefit >> > > > > from multiprocessing (max_threads for the scheduler). In addition >> you >> > > add >> > > > > load by hand then. >> > > > > Not ideal but it will work. >> > > > > >> > > > > Also depending the speed of your tasks finishing you could limit >> the >> > > > > heartbeat so the scheduler >> > > > > does not run redundantly while not being able to fire off new >> tasks. >> > > > > >> > > > > In addition why are you using num_runs? I definitely do not >> recommend >> > > > > using it with a >> > > > > LocalExecutor and if you are on 1.7.1.3 I would not use it with >> > Celery >> > > > > either. >> > > > > >> > > > > I hope this helps! >> > > > > >> > > > > Bolke >> > > > > >> > > > > > Op 13 jul. 2016, om 10:43 heeft Nadeem Ahmed Nazeer < >> > > > [email protected]> >> > > > > het volgende geschreven: >> > > > > > >> > > > > > Hi, >> > > > > > >> > > > > > We are using airflow to establish a data pipeline that runs >> tasks >> > on >> > > > > > ephemeral amazon emr cluster. The oldest data we have is from >> > > > 2014-05-26 >> > > > > > which we have set as the start date with a scheduler interval >> of 1 >> > > day >> > > > > for >> > > > > > airflow. >> > > > > > >> > > > > > We have an s3 copy task, a map reduce task and a bunch of hive >> and >> > > > impala >> > > > > > load tasks in our DAG all run via PythonOperator. Our >> expectation >> > is >> > > > for >> > > > > > airflow to run each of these tasks for each day from the start >> date >> > > > till >> > > > > > current date. >> > > > > > >> > > > > > Just for numbers, the number of dags that got created were >> > > > approximately >> > > > > > 800 from start date till current date (2016-07-13). All is well >> at >> > > the >> > > > > > start of the execution but as it executes more and more tasks, >> the >> > > > > > scheduling of tasks starts slowing down. Looks like the >> scheduler >> > is >> > > > > > spending lot of time in checking states and other houskeeping >> > tasks. >> > > > > > >> > > > > > One scheduler loop is taking almost 240 to 300 seconds due to >> the >> > > huge >> > > > > > number of tasks. It has been running my dags for over 24 hours >> now >> > > with >> > > > > > little progress. I am starting the scheduler process with >> restart >> > for >> > > > > every >> > > > > > 5 runs which is the default (airflow scheduler -n 5). >> > > > > > >> > > > > > I did play around with different parallelism and config >> parameters >> > > > > without >> > > > > > much help. I am looking for some assistance on making scheduler >> > > quickly >> > > > > and >> > > > > > effectively schedule the tasks. Please help. >> > > > > > >> > > > > > Configs : >> > > > > > parallelism = 32 >> > > > > > dag_concurrency = 16 >> > > > > > max_active_runs_per_dag = 99999 >> > > > > > celeryd_concurrency = 16 >> > > > > > scheduler_heartbeat_sec = 5 >> > > > > > >> > > > > > Thanks, >> > > > > > Nadeem >> > > > > >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > Lance Norskog >> > > [email protected] >> > > Redwood City, CA >> > > >> > >> > >
