Yati created AIRFLOW-2001:
-----------------------------
Summary: Make sensors relinquish their execution slots
Key: AIRFLOW-2001
URL: https://issues.apache.org/jira/browse/AIRFLOW-2001
Project: Apache Airflow
Issue Type: Bug
Components: db, scheduler
Reporter: Yati
Assignee: Yati
A sensor task instance should not take up an execution slot for the entirety of
its lifetime (as is currently the case). Indeed, for reasons outlined below, it
would be better if sensor execution was preempted by the scheduler by parking
it away from the slot till the next poll.
Some sensors sense for a condition to be true which is affected only by an
external party (e.g., materialization by external means of certain rows in a
table). By external, I mean external to the Airflow installation in question,
such that the producing entity itself does not need an execution slot in an
Airflow pool. If all sensors and their dependencies were of this nature, there
would be no issue. Unfortunately, a lot of real world DAGs have sensor
dependencies on results produced by another task, typically in some other DAG,
but scheduled by the same Airflow scheduler.
Consider a simple example (arrow direction represents "must happen before",
just like in Airflow): DAG1(a >> b) and DAG2(c:sensor(DAG1.b) >> d). In other
words, The opening task c of the second dag has a sensor dependency on the
ending task b of the first dag. Imagine we have a single pool with 10 execution
slots, and somehow task instances for c fill up the pool, while the
corresponding task instances of DAG1.b have not had a chance to execute (in the
real world this happens because of, say, back-fills or reprocesses by clearing
those sensors instances and their upstream). This is a deadlock situation,
since no progress can be made here – the sensors have filled up the pool
waiting on tasks that themselves will never get a chance to run. This problem
has been [acknowledged
here|https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls]
One way (suggested by Fokko) to solve this is to always run sensors on their
pool, and to be careful with the concurrency settings of sensor tasks. This is
what a lot of users do now, but there are better solutions to this. Since all
the sensor interface allows for is a poll, we can, after each poll, "park" the
sensor's execution slot and yield it to other tasks. In the above scenario,
there would be no "filling up" of the pool by sensors tasks, as they will be
polled, determined to be still unfulfilled, and then parked away, thereby
giving a chance to other tasks.
This would likely have some changes to the DB, and of course to the scheduler.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)