hmm.. Thanks Lance. I mentioned about pool for 'backfill' is because I saw that being a part 'default_args' airflow example.
Chris/Dan/Bolke/Jeremiah/Paul/all :) : So suppose I create two pools: 'poo1' and 'pool2' and use it for tasks t1 and t2. Now say I also create a pool call 'backfill' but not use it in any of the tasks inside my DAG. Whenever I run the backfill for my dag with ```--pool backfill```, will the scheduler use the slots from this backfill pool or will the tasks use pool1 and poo2? On Mon, Jun 20, 2016 at 9:20 PM, Dan Davydov <[email protected] > wrote: > At the moment by default backfill does not use a pool but you can specify > one with --pool. > > On Mon, Jun 20, 2016 at 9:02 PM, Chris Riccomini <[email protected]> > wrote: > > > Hey Harish, > > > > One thing that I'm not clear on is whether backfill even honors pools at > > all. I believe backfill currently starts its own scheduler outside of the > > main scheduler process. As a result, I think the pools are completely > > disregarded. Bolke/Jeremiah/Paul can correct me if I'm wrong. > > > > Cheers, > > Chris > > > > On Mon, Jun 20, 2016 at 7:46 PM, Lance Norskog <[email protected]> > > wrote: > > > > > One reason to use Pools is because you have tasks in different DAGs > that > > > all use the same resource, like a database. A Pool lets you say, "I > will > > > send no more than 3 requests to this database at once". However, there > > are > > > bugs in the scheduler and it is possible to have many active tasks > > > overscheduled against a pool. > > > > > > You can create a pool in the Admin->Pools drop-down. You don't need a > > > script. > > > > > > On Mon, Jun 20, 2016 at 2:46 PM, harish singh < > [email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > We have been using airflow for few 3 months now. > > > > > > > > One pain I felt was, during backfill if I have 2 tasks t1 and t2 - > with > > > t1 > > > > having depends_on_past=true, > > > > t0 -> t1 > > > > t0 -> t2 > > > > > > > > I find that the task t2 with no past dependency keeps getting > > scheduled. > > > > This causes the task t1 to wait for a long time before it gets > > scheduled. > > > > > > > > I think this is a good use case for creating "pools" and allocate > slots > > > for > > > > each pool. > > > > Also, I will have to use priority_weights. And adjust parallelism!!! > > > > > > > > Is there a better way to handle this? > > > > > > > > > > > > Also, in general, are there any examples on how to use pools? > > > > > > > > I peeked into* airflow/tests/operators/subdag_operator.py *and found > > the > > > > below snippet: > > > > > > > > session = airflow.settings.Session() > > > > pool_1 = airflow.models.Pool(pool='test_pool_1', slots=1) > > > > session.add(pool_1) > > > > session.commit() > > > > > > > > Why do we need Session instance? Do we need to run the below code > > before > > > > creating a pool in code (inside my pipeline.py under dags/ > directory): > > > > > > > > *pool = ( > > > > session.query(Pool) > > > > .filter(Pool.pool == 'AIRFLOW-205') > > > > .first()) > > > > if not pool: > > > > session.add(Pool(pool='AIRFLOW-205', slots=8)) > > > > session.commit()* > > > > > > > > > > > > Also, I saw few places where pool: 'backfill' is used? > > > > > > > > Is 'backfill' a special pre-defined pool? > > > > > > > > > > > > If not, how do we create different types of pools based on whether it > > > > is backfill or not? > > > > > > > > > > > > All this is being done in pipeline.py script under 'dags/' directory. > > > > > > > > > > > > Thanks, > > > > Harish > > > > > > > > > > > > > > > > -- > > > Lance Norskog > > > [email protected] > > > Redwood City, CA > > > > > >
