[GitHub] [airflow] ashb commented on a diff in pull request #24743: Get dataset-driven scheduling working

GitBox Thu, 07 Jul 2022 09:27:59 -0700


ashb commented on code in PR #24743:
URL: https://github.com/apache/airflow/pull/24743#discussion_r915711168



##########
airflow/utils/types.py:
##########
@@ -45,6 +45,7 @@ class DagRunType(str, enum.Enum):
     BACKFILL_JOB = "backfill"
     SCHEDULED = "scheduled"
     MANUAL = "manual"
+    DATASE = "dataset"

Review Comment:
   a)
   
   ```suggestion
       DATASET = "dataset"
   ```
   
   b) also this isn't quite the right name (as in it doesn't quite capture the 
meaning of what is happening). How about something like one of these
   
   ```python
       DATASET_TRIGGERED = "dataset_triggered"
       TRIGGERED = "triggered"
       DATASET_SCHEDULED = "dataset_scheduled"
   ```



##########
airflow/models/taskinstance.py:
##########
@@ -1512,9 +1513,24 @@ def _run_raw_task(
         if not test_mode:
             session.add(Log(self.state, self))
             session.merge(self)
-
+            self._create_dataset_dag_run_queue_records(session=session)
             session.commit()
 
+    def _create_dataset_dag_run_queue_records(self, *, session=NEW_SESSION):

Review Comment:
   ```suggestion
       def _create_dataset_dag_run_queue_records(self, *, session):
   ```
   
   (since it is a required value)



##########
airflow/models/taskinstance.py:
##########
@@ -1512,9 +1513,24 @@ def _run_raw_task(
         if not test_mode:
             session.add(Log(self.state, self))
             session.merge(self)
-
+            self._create_dataset_dag_run_queue_records(session=session)
             session.commit()
 
+    def _create_dataset_dag_run_queue_records(self, *, session=NEW_SESSION):
+        from airflow.models import Dataset
+
+        for obj in getattr(self.task, '_outlets', []):
+            self.log.debug("outlet obj %s", obj)
+            if isinstance(obj, Dataset):
+                dataset = session.query(Dataset).filter(Dataset.uri == 
obj.uri).first()

Review Comment:
   ```suggestion
                   dataset = session.query(Dataset).filter(Dataset.uri == 
obj.uri).one_or_none()
   ```



##########
airflow/serialization/schema.json:
##########
@@ -90,6 +90,7 @@
             { "$ref": "#/definitions/typed_relativedelta" }
           ]
         },
+        "schedule_on": { "type": "array" },

Review Comment:
   Could you create a dataset def (dict with uri and extra properties) and then 
   ```suggestion
           "schedule_on": {
               "type": "array",
               "items": { "$ref": "#/definitions/dataset" }
           },
   ```



##########
tests/models/test_dagrun.py:
##########
@@ -1291,6 +1309,38 @@ def consumer(*args):
     assert dr.state == DagRunState.FAILED
 
 
+def test_dataset_dagruns_triggered(session):
+    from airflow.example_dags import example_datasets
+    from airflow.example_dags.example_datasets import dag1
+
+    session = settings.Session()
+    dagbag = DagBag(dag_folder=example_datasets.__file__)
+    dagbag.collect_dags(only_if_updated=False, safe_mode=False)
+    dagbag.sync_to_db(session=session)
+    run_id = str(uuid4())
+    dr = DagRun(dag1.dag_id, run_id=run_id, run_type='anything')
+    dr.dag = dag1
+    session.add(dr)
+    session.commit()
+    assert dr.id is not None
+    task = dag1.get_task('upstream_task_1')
+    task.bash_command = 'echo 1'  # make it go faster
+    ti = TaskInstance(task, run_id=run_id)
+    session.merge(ti)
+    session.commit()
+    ti._run_raw_task()
+    ti.refresh_from_db()
+    assert ti.state == State.SUCCESS
+    assert session.query(DatasetDagRunQueue.target_dag_id).filter(
+        DatasetTaskRef.dag_id == dag1.dag_id, DatasetTaskRef.task_id == 
'upstream_task_1'
+    ).all() == [('dag3',), ('dag4',), ('dag5',)]

Review Comment:
   This is "double testing" the behaviour of 
`TI._create_dataset_dag_run_queue_records` (that is already tested via 
`test_outlet_datasets`).
   
   Could we instead please have this test insert the queue rows it wants 
directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] ashb commented on a diff in pull request #24743: Get dataset-driven scheduling working

Reply via email to