bolkedebruin commented on code in PR #29433:
URL: https://github.com/apache/airflow/pull/29433#discussion_r1111675033
##########
airflow/datasets/manager.py:
##########
@@ -55,23 +61,33 @@ def register_dataset_change(
dataset_model = session.query(DatasetModel).filter(DatasetModel.uri ==
dataset.uri).one_or_none()
if not dataset_model:
self.log.warning("DatasetModel %s not found", dataset)
- return
- session.add(
- DatasetEvent(
+ return None
+
+ if task_instance:
+ dataset_event = DatasetEvent(
dataset_id=dataset_model.id,
source_task_id=task_instance.task_id,
source_dag_id=task_instance.dag_id,
source_run_id=task_instance.run_id,
source_map_index=task_instance.map_index,
extra=extra,
)
- )
+ else:
+ # When an external dataset change is made through the API, it
isn't triggered by a task instance,
+ # so we create a DatasetEvent without the task and dag data.
+ dataset_event = DatasetEvent(
Review Comment:
It would be great to have extra information available when the dataset has
externally changed such as:
* by whom - `external_auth_id` or `external_service_id` -> required
* from where (api, client_ip / remote_addr) - `external_source` -> required
* the timestamp of the actual event - so it can be reconciled if required ->
Nullable as it might not be available
This ensures lineage isn't broken across systems
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]