jscheffl commented on code in PR #38481:
URL: https://github.com/apache/airflow/pull/38481#discussion_r1542118514
##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -224,6 +224,29 @@ If one dataset is updated multiple times before all
consumed datasets have been
}
+Attaching extra information to an emitting Dataset Event
+--------------------------------------------------------
+
+.. versionadded:: 2.10.0
+
+A task with a dataset outlet can optionally attach extra information before it
emits a dataset event. This is different
+from `Extra information on Dataset`_. Extra information on a dataset
statically describes the entity pointed to by the dataset URI; extra
information on the *dataset event* instead should be used to annotate the
triggering data change, such as how many rows in the database are changed by
the update, or the date range covered by it.
+
+The easiest way to attach extra information to the dataset event is by
accessing ``dataset_events`` in a task's execution context:
+
+.. code-block:: python
+
+ example_s3_dataset = Dataset("s3://dataset/example.csv")
+
+
+ @task(outlets=[example_s3_dataset])
+ def write_to_s3(*, dataset_events):
+ df = ... # Get a Pandas DataFrame to write.
+ # Write df to dataset...
+ dataset_events[example_s3_dataset].extras = {"row_count": len(df)}
+
+This can also be done in classic operators by either subclassing the operator
and overriding ``execute``, or by supplying a pre- or post-execution function.
Review Comment:
Are you sure that attaching extra is also possible in `post_execution` hook?
Sure this is not executed after the event has been emitted?
##########
airflow/models/taskinstance.py:
##########
@@ -761,6 +767,7 @@ def get_triggering_events() -> dict[str, list[DatasetEvent
| DatasetEventPydanti
"dag_run": dag_run,
"data_interval_end": timezone.coerce_datetime(data_interval.end),
"data_interval_start": timezone.coerce_datetime(data_interval.start),
+ "dataset_events": DatasetEventAccessors(),
Review Comment:
I was searching our codebase but could not find a good documentation about
the content of `context`as being published. Would it now (with this PR) be time
to document this for users?
I first thought of docs/apache-airflow/templates-ref.rst but this is only
for templating, content-wise it overlaps a lot with the content in context.
##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -99,8 +99,8 @@ The identifier does not have to be absolute; it can be a
scheme-less, relative U
Non-absolute identifiers are considered plain strings that do not carry any
semantic meanings to Airflow.
-Extra information
------------------
+Extra information on Dataset
Review Comment:
THANKS for this clarification!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]