Re: [VOTE] AIP-48 Data Driven Scheduling

Ash Berlin-Taylor Wed, 08 Jun 2022 09:59:34 -0700

And the reason for including dag_id/task_id in that new table is toavoid a new "top level query", but instead that we can add in to theexisting "look at dagrun X" scheduling.

(Strictly speaking we could just have a table of "dataset X has justbeen published", but for that to work we would need a new "top level"section of the main scheduler loop, which adds extra cost even whennothing is to be done.)


-ash

On Wed, Jun 8 2022 at 17:23:21 +0100, Ash Berlin-Taylor<[email protected]> wrote:

Hi Ping,

Good idea.

Very roughly:
We will have a create a use a new database table to store a queue ofdatasets publishes to be actioned, with columns (dag_id, task_id,run_id, dataset_uri). Rows in that table are created by TaskInstancein the same transaction where that TI is marked to success (This isimportant so that we don't "miss"
The DagRuns for dataset aware scheduling will run from within theDagRun.update_state function of the _producing_ DagRun (which iscalled from either the mini scheduler in LocalTaskJob, or mainscheduler loop) will look at pending rows and create any necessaryDagRuns there, and then delete the pending_dataset_events row.
We need to store these in a DB as a queue since there can be multipleplaces where the code could run, and we don't want to loose events,i.e. a simple "last_updated" on the row in the "dataset" DB table isnot enough, as it's possible two TIs from different DagRuns couldcomplete at almost the same instance, and we don't want to not createa data-scheduled DagRun.
The advantage of doing it this was is that there is next-to-no impacton scheduling performance if you don't have a task that producesdatasets.
Does that give you enough an idea as to how we plan on implementingthis?
Thanks,
Ash
On Mon, Jun 6 2022 at 19:11:40 -0700, Ping Zhang <[email protected]>wrote:
Hi Ash,
Thanks for introducing the data-driven scheduling and thoughtfulfuture work section in the AIP.
Could you please add a section to talk about the high level/earlytech design in the AIP? This is a new scheduling pattern that canhave many implications, thus we (our team) would love to know moreabout how you will design and implement it.
Thanks,


Ping
On Wed, Jun 1, 2022 at 9:34 AM Ash Berlin-Taylor <[email protected]<mailto:[email protected]>> wrote:
Hi All,
Now that Summit is over (well done all the speakers! The talks I'vecaught so far have been great) I'm ready to push forward with DataDriven Scheduling, and I would like to call for a vote on<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling>
The vote for last for 7 days, until 2022/06/07 at 16:30 UTC.

(This is my +1 vote)
I have just published updates to the AIP, hopefully to make the AIPtighter in scope (and easier to implement too). The tl;dr of thisAIP:
- Add a concept of Dataset (which is a uri-parsable str. Airflowplaces no meaning on what the URI contains/means/is - "airflow:"scheme is reserved)- A task "produces" a dataset by a) Having it in it's outletsattribute, and b) finishing with SUCCESS. (That is, Airflow doesn'tknow/care about data transfer/SQL tables etc. It is justconceptually)- A DAG says that it wants to be triggered when it's dataset (orany of it's datasets) change. When this happens the scheduler willcreate the dag run.
This is just a high level summary, please read the confluence pagefor full details.
We have already thought about lots of ways we can (and will) extendthis in the over time, detailed in the "Future work" section. Ourgoal with this AIP is to build the kernel of Data-aware Schedulingthat we can build on over time.
A teaser/example DAG that hopefully gives a clue as to what we aretalking about here:
```
import pandas as pd

from airflowimport dag, Dataset


dataset= Dataset("s3://s3_default@some_bucket/order_data")
@dag
def my_dag():

    @dag.task(outlets=[dataset])
    def producer():
# What this task actually does doesn't matter to Airflow,the simple act of running to SUCCESS means the dataset
        # is updated, and downstream dags will get triggered
        ...



dataset= Dataset("s3://s3_default@some_bucket/order_data")
@dag(schedule_on=dataset)
def consuming_dag():
    @dag.task
    def consumer(uri):
        df= pandas.read_from_s3(uri)
        print(f" Dataset had {df.count()} rows")

    consumer(df=ref.uri)
```
If anyone has any changes you think are fundamental/foundational tothe core idea you have 1 week to raise it :) (Names of parameterswe can easily change as we implement this) Our desire is to getthis written and released Airflow 2.4.
Thanks,
Ash

Re: [VOTE] AIP-48 Data Driven Scheduling

Reply via email to