Yati created AIRFLOW-2001:
-----------------------------

             Summary: Make sensors relinquish their execution slots
                 Key: AIRFLOW-2001
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2001
             Project: Apache Airflow
          Issue Type: Bug
          Components: db, scheduler
            Reporter: Yati
            Assignee: Yati


A sensor task instance should not take up an execution slot for the entirety of 
its lifetime (as is currently the case). Indeed, for reasons outlined below, it 
would be better if sensor execution was preempted by the scheduler by parking 
it away from the slot till the next poll.

 Some sensors sense for a condition to be true which is affected only by an 
external party (e.g., materialization by external means of certain rows in a 
table). By external, I mean external to the Airflow installation in question, 
such that the producing entity itself does not need an execution slot in an 
Airflow pool. If all sensors and their dependencies were of this nature, there 
would be no issue. Unfortunately, a lot of real world DAGs have sensor 
dependencies on results produced by another task, typically in some other DAG, 
but scheduled by the same Airflow scheduler.

Consider a simple example (arrow direction represents "must happen before", 
just like in Airflow): DAG1(a >> b) and DAG2(c:sensor(DAG1.b) >> d). In other 
words, The opening task c of the second dag has a sensor dependency on the 
ending task b of the first dag. Imagine we have a single pool with 10 execution 
slots, and somehow task instances for c fill up the pool, while the 
corresponding task instances of DAG1.b have not had a chance to execute (in the 
real world this happens because of, say, back-fills or reprocesses by clearing 
those sensors instances and their upstream). This is a deadlock situation, 
since no progress can be made here – the sensors have filled up the pool 
waiting on tasks that themselves will never get a chance to run. This problem 
has been [acknowledged 
here|https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls]

One way (suggested by Fokko) to solve this is to always run sensors on their 
pool, and to be careful with the concurrency settings of sensor tasks. This is 
what a lot of users do now, but there are better solutions to this. Since all 
the sensor interface allows for is a poll, we can, after each poll, "park" the 
sensor's execution slot and yield it to other tasks. In the above scenario, 
there would be no "filling up" of the pool by sensors tasks, as they will be 
polled, determined to be still unfulfilled, and then parked away, thereby 
giving a chance to other tasks.

This would likely have some changes to the DB, and of course to the scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to