ashb commented on code in PR #24743:
URL: https://github.com/apache/airflow/pull/24743#discussion_r915711168
##########
airflow/utils/types.py:
##########
@@ -45,6 +45,7 @@ class DagRunType(str, enum.Enum):
BACKFILL_JOB = "backfill"
SCHEDULED = "scheduled"
MANUAL = "manual"
+ DATASE = "dataset"
Review Comment:
a)
```suggestion
DATASET = "dataset"
```
b) also this isn't quite the right name (as in it doesn't quite capture the
meaning of what is happening). How about something like one of these
```python
DATASET_TRIGGERED = "dataset_triggered"
TRIGGERED = "triggered"
DATASET_SCHEDULED = "dataset_scheduled"
```
##########
airflow/models/taskinstance.py:
##########
@@ -1512,9 +1513,24 @@ def _run_raw_task(
if not test_mode:
session.add(Log(self.state, self))
session.merge(self)
-
+ self._create_dataset_dag_run_queue_records(session=session)
session.commit()
+ def _create_dataset_dag_run_queue_records(self, *, session=NEW_SESSION):
Review Comment:
```suggestion
def _create_dataset_dag_run_queue_records(self, *, session):
```
(since it is a required value)
##########
airflow/models/taskinstance.py:
##########
@@ -1512,9 +1513,24 @@ def _run_raw_task(
if not test_mode:
session.add(Log(self.state, self))
session.merge(self)
-
+ self._create_dataset_dag_run_queue_records(session=session)
session.commit()
+ def _create_dataset_dag_run_queue_records(self, *, session=NEW_SESSION):
+ from airflow.models import Dataset
+
+ for obj in getattr(self.task, '_outlets', []):
+ self.log.debug("outlet obj %s", obj)
+ if isinstance(obj, Dataset):
+ dataset = session.query(Dataset).filter(Dataset.uri ==
obj.uri).first()
Review Comment:
```suggestion
dataset = session.query(Dataset).filter(Dataset.uri ==
obj.uri).one_or_none()
```
##########
airflow/serialization/schema.json:
##########
@@ -90,6 +90,7 @@
{ "$ref": "#/definitions/typed_relativedelta" }
]
},
+ "schedule_on": { "type": "array" },
Review Comment:
Could you create a dataset def (dict with uri and extra properties) and then
```suggestion
"schedule_on": {
"type": "array",
"items": { "$ref": "#/definitions/dataset" }
},
```
##########
tests/models/test_dagrun.py:
##########
@@ -1291,6 +1309,38 @@ def consumer(*args):
assert dr.state == DagRunState.FAILED
+def test_dataset_dagruns_triggered(session):
+ from airflow.example_dags import example_datasets
+ from airflow.example_dags.example_datasets import dag1
+
+ session = settings.Session()
+ dagbag = DagBag(dag_folder=example_datasets.__file__)
+ dagbag.collect_dags(only_if_updated=False, safe_mode=False)
+ dagbag.sync_to_db(session=session)
+ run_id = str(uuid4())
+ dr = DagRun(dag1.dag_id, run_id=run_id, run_type='anything')
+ dr.dag = dag1
+ session.add(dr)
+ session.commit()
+ assert dr.id is not None
+ task = dag1.get_task('upstream_task_1')
+ task.bash_command = 'echo 1' # make it go faster
+ ti = TaskInstance(task, run_id=run_id)
+ session.merge(ti)
+ session.commit()
+ ti._run_raw_task()
+ ti.refresh_from_db()
+ assert ti.state == State.SUCCESS
+ assert session.query(DatasetDagRunQueue.target_dag_id).filter(
+ DatasetTaskRef.dag_id == dag1.dag_id, DatasetTaskRef.task_id ==
'upstream_task_1'
+ ).all() == [('dag3',), ('dag4',), ('dag5',)]
Review Comment:
This is "double testing" the behaviour of
`TI._create_dataset_dag_run_queue_records` (that is already tested via
`test_outlet_datasets`).
Could we instead please have this test insert the queue rows it wants
directly?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]