Hi,
We have been using airflow for few 3 months now.
One pain I felt was, during backfill if I have 2 tasks t1 and t2 - with t1
having depends_on_past=true,
t0 -> t1
t0 -> t2
I find that the task t2 with no past dependency keeps getting scheduled.
This causes the task t1 to wait for a long time before it gets scheduled.
I think this is a good use case for creating "pools" and allocate slots for
each pool.
Also, I will have to use priority_weights. And adjust parallelism!!!
Is there a better way to handle this?
Also, in general, are there any examples on how to use pools?
I peeked into* airflow/tests/operators/subdag_operator.py *and found the
below snippet:
session = airflow.settings.Session()
pool_1 = airflow.models.Pool(pool='test_pool_1', slots=1)
session.add(pool_1)
session.commit()
Why do we need Session instance? Do we need to run the below code before
creating a pool in code (inside my pipeline.py under dags/ directory):
*pool = (
session.query(Pool)
.filter(Pool.pool == 'AIRFLOW-205')
.first())
if not pool:
session.add(Pool(pool='AIRFLOW-205', slots=8))
session.commit()*
Also, I saw few places where pool: 'backfill' is used?
Is 'backfill' a special pre-defined pool?
If not, how do we create different types of pools based on whether it
is backfill or not?
All this is being done in pipeline.py script under 'dags/' directory.
Thanks,
Harish