potiuk commented on issue #24778:
URL: https://github.com/apache/airflow/issues/24778#issuecomment-1172637873

   Interesting idea, but I tihnk it's mixing "task graph" with "artifact graph" 
behaviour. And actually we already have the AIP 48 in progress that (IMHO) 
implements the "concept" you somehow have in mind in a much better way - 
without actually changing the upstream/downstream behaviour of Airflow and 
changing the branching concept.
   
   The mechanism you describe is fine for "build system" when you decide what 
'target" you want to achieve, but I believe Airflow DAGs are describing the 
"process" not the "target" . 
   
   The whole premise of DAGs is to describe what procesing tasks should happen, 
not what "data artifact" we want to achieve as a result of the DAG run. Each of 
the steps in Airlfow DAG might result in an artifact dataset - even more than 
one  that might be used inside the DAG, but what makes it different from the 
`make` - it also might be used outside of the DAG.  
   
   The parallel to "build systems" is wrong - because nodes in the "build 
system" are the "artifacts" themselves. You specify a "binary", "library", 
"source" as "nodes" in the graph and describe relations between them and tell 
"I want to get this arifact and please find out which other artifacts are 
needed". Airlfow DAG is different - it does not describe artifacts, it 
describes tasks - i.e. actions that might produce the artifacts. In the build 
system you do not specify "I want to run complilation task on  X to recive Y", 
you specify "I want to get X and it needs Y", but you let the system figure out 
what task needs to be run to get  from X to Y. 
   
   For me the idea you have is great to describe "data dependencies" (build 
system) but not "task dependencies" (Airflow). And we are going to implement 
the use-case you talk about without changing Airlfow task dependencies mode.
   
   This is maybe not as clear but with 
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
  being implemented, we will enable what you think about - but way better. We 
are not "reversing" the branching concept. We are adding "dataset" concept into 
existing DAG structure (which is good). This is the big part of bringing the 
data lineage into airflow world and I thiink this is really what you are about. 
you are not interested in running "taskA" or "taskB". You are really interested 
in getting dataset "D1" or "D2" instead and a way how to do that. 
   
   By implementing data dependencies and scheduling and adding open-lineage on 
top, we are going to add an option for anyone to get the "I want to generate 
the dataset X - which tasks should be run to get it ?". I believe this is what 
you are really asking for here.
   
   There was an excellent talk which is very related to it from Willy Luciuc at 
the Airlfow Summit 
https://airflowsummit.org/sessions/2022/automating-airflow-backfills-with-marquez/
 
   
   I strongly recommend to watch it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to