If you have the ability to run code from the external system you might want to
consider using the ("experimental") API to trigger the dag run from the
external system?
http://airflow.apache.org/docs/stable/api.html#post--api-experimental-dags--DAG_ID--dag_runs
When using the API doesn't work for you the common approach I have seen is as
you hint at -- having a "trigger" dag that runs (frequently depending on your
needs), checks the external condition and uses TriggerDagRunOperator.
The other way I have seen this done is to just have the first task of your dag
be a sensor that checks/waits on the external resource. With the recently added
"reschedule" mode of sensors this also doesn't tie up a worker slot when the
sensor isn't running. This is the approach I have used in the past when
processingly weekly datasets that would appear anywhere in a 72 hour window
after the expected delivery time.
Given these options exist I'm not quite sure I see the need for a new parameter
to the DAG (especially one which runs user code in the scheduler, that gets
quite a strong no from me) Could you perhaps explain your idea in more detail,
specifically how it fits in to your workflow, and why you don't want to use the
two methods I talked about here?
Thanks,
Ash
On Feb 14 2020, at 10:10 am, bharath palaksha <[email protected]> wrote:
> Hi,
>
> I have been using airflow extensively in my current work at Walmart Labs.
> while working on our requirements, came across a functionality which is
> missing in airflow and if implemented will be very useful.
> Currently, Airflow is a schedule based workflow management, a cron
> expression defines the creation of dag runs. If there is a dependency on a
> different dag - TriggerDagRunOperator helps in creating dag runs.
> Suppose, there is a dependency which is outside of Airflow cluster eg:
> different database, filesystem or an event from an API which is an upstream
> dependency. There is no way in Airflow to achieve this unless we schedule a
> DAG for a a very short interval and allow it to poll.
>
> To solve above issue, what if airflow takes 2 different args -
> schedule_interval
> and trigger_sensor.
>
> - schedule_interval - works the same way as it is already working now
> - trigger_sensor - accepts a sensor which returns true when an event is
> sensed and this in turn creates a dag run
>
> If you specify both the argument, schedule_interval takes precedence.
> Scheduler parses all DAGs in a loop for every heartbeat and checks for DAGs
> which has reached scheduled time and creates DAG run, same loop can also
> check for trigger_sensor and if argument is set - check if it returns true
> to create dag run. This might slow down scheduler as it has to execute
> sensors now, we can find some other way to avoid slowness.
> Can we create AIP for this? Any thoughts?
>
> Thanks,
> Bharath
>