[GitHub] [airflow] potiuk commented on issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

GitBox Fri, 01 Jul 2022 12:03:49 -0700


potiuk commented on issue #24778:
URL: https://github.com/apache/airflow/issues/24778#issuecomment-1172637873

Interesting idea, but I tihnk it's mixing "task graph" with "artifact graph"
behaviour. And actually we already have the AIP 48 in progress that (IMHO)
implements the "concept" you somehow have in mind in a much better way -
without actually changing the upstream/downstream behaviour of Airflow and
changing the branching concept.

The mechanism you describe is fine for "build system" when you decide what
'target" you want to achieve, but I believe Airflow DAGs are describing the
"process" not the "target" .

The whole premise of DAGs is to describe what procesing tasks should happen,
not what "data artifact" we want to achieve as a result of the DAG run. Each of
the steps in Airlfow DAG might result in an artifact dataset - even more than
one that might be used inside the DAG, but what makes it different from the
`make` - it also might be used outside of the DAG.

The parallel to "build systems" is wrong - because nodes in the "build
system" are the "artifacts" themselves. You specify a "binary", "library",
"source" as "nodes" in the graph and describe relations between them and tell
"I want to get this arifact and please find out which other artifacts are
needed". Airlfow DAG is different - it does not describe artifacts, it
describes tasks - i.e. actions that might produce the artifacts. In the build
system you do not specify "I want to run complilation task on X to recive Y",
you specify "I want to get X and it needs Y", but you let the system figure out
what task needs to be run to get from X to Y.

For me the idea you have is great to describe "data dependencies" (build
system) but not "task dependencies" (Airflow). And we are going to implement
the use-case you talk about without changing Airlfow task dependencies mode.

This is maybe not as clear but with
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
being implemented, we will enable what you think about - but way better. We
are not "reversing" the branching concept. We are adding "dataset" concept into
existing DAG structure (which is good). This is the big part of bringing the
data lineage into airflow world and I thiink this is really what you are about.
you are not interested in running "taskA" or "taskB". You are really interested
in getting dataset "D1" or "D2" instead and a way how to do that.

By implementing data dependencies and scheduling and adding open-lineage on
top, we are going to add an option for anyone to get the "I want to generate
the dataset X - which tasks should be run to get it ?". I believe this is what
you are really asking for here.

There was an excellent talk which is very related to it from Willy Luciuc at
the Airlfow Summit
https://airflowsummit.org/sessions/2022/automating-airflow-backfills-with-marquez/

I strongly recommend to watch it.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

Reply via email to