error418 commented on issue #35297:
URL: https://github.com/apache/airflow/issues/35297#issuecomment-1839168788
Hi @tokoko,
`register_dataset_change` seems never to be called with its `extra`
parameter. It would be fine, if we would be able to pull and store extra data
from/to the event.
As it seems the `Dataset`-Entity primary key is `uri`, which does not allow
multiple Datasets with different `extra` configurations.
https://github.com/apache/airflow/blob/35a1b7a63a7e9eab299955e0b35f2fd3614b22ee/airflow/datasets/manager.py#L66
`DatasetEvent` would therefore the only available place to store `Dataset`
`extra` data emitted from tasks.
Like in the issue mentioned, the problem lies in the omitted `extra`
parameter in `taskinstance.py`, which is responsible for populating the `extra`
property of `DatasetEvent`
https://github.com/apache/airflow/blob/55b015f995def3bc8a3a9eef6abd7bcad49888f7/airflow/models/taskinstance.py#L2342-L2346
A fix might look like this
```python
def _register_dataset_changes(self, *, session: Session) -> None:
for obj in self.task.outlets or []:
self.log.debug("outlet obj %s", obj)
# Lineage can have other types of objects besides datasets
if isinstance(obj, Dataset):
dataset_manager.register_dataset_change(
task_instance=self,
dataset=obj,
session=session,
extra=obj.extra
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]