[GitHub] [airflow] dstandish commented on a diff in pull request #24743: Get dataset-driven scheduling working

GitBox Mon, 04 Jul 2022 08:01:47 -0700


dstandish commented on code in PR #24743:
URL: https://github.com/apache/airflow/pull/24743#discussion_r913084243



##########
airflow/models/dagrun.py:
##########
@@ -631,6 +631,32 @@ def update_state(
         session.merge(self)
         # We do not flush here for performance reasons(It increases queries 
count by +20)
 
+        from airflow.models import Dataset
+        from airflow.models.dataset_dag_run_event import DatasetDagRunEvent as 
DDRE
+        from airflow.models.serialized_dag import SerializedDagModel
+
+        datasets = []
+        for task in self.dag.tasks:
+            for outlet in getattr(task, '_outlets', []):
+                if isinstance(outlet, Dataset):
+                    datasets.append(outlet)
+        dataset_ids = [x.get_dataset_id(session=session) for x in datasets]
+        events_to_process = 
session.query(DDRE).filter(DDRE.dataset_id.in_(dataset_ids)).all()

Review Comment:
   > I was hoping to explicitly not have to do this (as mentioned in the thread 
I linked.)
   
   i know, just listing out the options i see
   
   > As currently "specified" that would result in two runs being triggered, 
not one, so this is not a problem for now.
   
   by specified you mean how the PR is currently structured?  yeah i guess 
that's true -- we can't get a "new dagrun" conflict since we're essentially 
triggering a "manual" which should create a unique run id by timestamp.  so to 
prevent "duplicate" dag runs we'd have to generate run id deterministically, 
but then we introduce the conflict problem.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] dstandish commented on a diff in pull request #24743: Get dataset-driven scheduling working

Reply via email to