[GitHub] [airflow] dstandish commented on a diff in pull request #24743: WIP Add other dataset models -- dag-based dataset deps

GitBox Thu, 30 Jun 2022 22:37:17 -0700


dstandish commented on code in PR #24743:
URL: https://github.com/apache/airflow/pull/24743#discussion_r911628658



##########
airflow/models/dagrun.py:
##########
@@ -631,6 +631,32 @@ def update_state(
         session.merge(self)
         # We do not flush here for performance reasons(It increases queries 
count by +20)
 
+        from airflow.models import Dataset
+        from airflow.models.dataset_dag_run_event import DatasetDagRunEvent as 
DDRE
+        from airflow.models.serialized_dag import SerializedDagModel
+
+        datasets = []
+        for task in self.dag.tasks:
+            for outlet in getattr(task, '_outlets', []):
+                if isinstance(outlet, Dataset):
+                    datasets.append(outlet)
+        dataset_ids = [x.get_dataset_id(session=session) for x in datasets]
+        events_to_process = 
session.query(DDRE).filter(DDRE.dataset_id.in_(dataset_ids)).all()

Review Comment:
   Ok yes this makes sense for single upstream.  But we have change this query 
/ process substantially to handle multiple upstream.  And for multiple 
upstream, if the logic is located in `update_state` I think we cannot avoid the 
possibility having two different dag runs (not necessarily same dag) trying to 
create the same dag run.
   
   Here's an example:
   
   dag1 -> dataset1
   dag2 -> dataset2
   dataset1 -> dag3
   dataset2 -> dag3
   
   So if the dag runs dag 1 and dag 2 are being handled by different 
schedulers, they could both try to create dag 3.  And I'm not sure run id will 
help in that scenario.
   
   I can think of a few solutions:
   * make sure `create_dagrun` does essentially `insert ignore` or `on conflict 
do nothing` so that if the dag run is already created, it will just do nothing. 
 i believe some version of this is supported on all our databases.
   * do the dagrun creation in a commit isolated from any other operation, so 
that if it fails we can simply catch and ignore.
   * add a new top level scheduler query that will partition dags using the 
same kind of skip locked logic of the main scheduler query  and create 
necessary dag runs
   
   In any case, essentially what we have to do is, for each dataset updated by 
this dag, for each dag downstream of that dataset, check if all its upstream 
datasets are updated, and if so create a dag run.  
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] dstandish commented on a diff in pull request #24743: WIP Add other dataset models -- dag-based dataset deps

Reply via email to