jedcunningham commented on code in PR #40335: URL: https://github.com/apache/airflow/pull/40335#discussion_r1678038196
########## docs/apache-airflow/administration-and-deployment/lineage.rst: ########## @@ -89,6 +89,48 @@ has outlets defined (e.g. by using ``add_outlets(..)`` or has out of the box sup .. _precedence: https://docs.python.org/3/reference/expressions.html +Hook Lineage +------------ + +Airflow provides a powerful feature for tracking data lineage not only between tasks but also from hooks used within those tasks. +This functionality helps you understand how data flows throughout your Airflow pipelines. + +A global instance of ``HookLineageCollector`` serves as the central hub for collecting lineage information. +Hooks can send details about datasets they interact with to this collector. +The collector then uses this data to construct AIP-60 compliant Datasets, a standard format for describing datasets. + +.. code-block:: python + + from airflow.lineage.hook_lineage import get_hook_lineage_collector + + + class CustomHook(BaseHook): + def run(self): + # run actual code + collector = get_hook_lineage_collector() + collector.add_input_dataset(self, dataset_kwargs={"scheme": "file", "path": "/tmp/in"}) + collector.add_output_dataset(self, dataset_kwargs={"scheme": "file", "path": "/tmp/out"}) + +Lineage data collected by the ``HookLineageCollector`` can be accessed using an instance of ``HookLineageReader``. Review Comment: ```suggestion Lineage data collected by the ``HookLineageCollector`` can be accessed using an instance of ``HookLineageReader``, which is registered in an Airflow plugin. ``` Or similar. Let's state it vs it just being in the example code. ########## docs/apache-airflow/administration-and-deployment/lineage.rst: ########## @@ -89,6 +89,42 @@ has outlets defined (e.g. by using ``add_outlets(..)`` or has out of the box sup .. _precedence: https://docs.python.org/3/reference/expressions.html +Hook Lineage +------------ + +Airflow provides a powerful feature for tracking data lineage not only between tasks but also from hooks used within those tasks. +This functionality helps you understand how data flows throughout your Airflow pipelines. + +A global instance of ``HookLineageCollector`` serves as the central hub for collecting lineage information. +Hooks can send details about datasets they interact with to this collector. +The collector then uses this data to construct AIP-60 compliant Datasets, a standard format for describing datasets. + +.. code-block:: python + + from airflow.lineage.hook_lineage import get_hook_lineage_collector + + + class CustomHook(BaseHook): + def run(self): + # run actual code + collector = get_hook_lineage_collector() + collector.add_inlet(dataset_kwargs={"scheme": "file", "path": "/tmp/in"}, self) + collector.add_outlet(dataset_kwargs={"scheme": "file", "path": "/tmp/out"}, self) + +Lineage data collected by the ``HookLineageCollector`` can be accessed using an instance of ``HookLineageReader``. + +.. code-block:: python + + from airflow.lineage.hook_lineage import HookLineageReader + + + class CustomHookLineageReader(HookLineageReader): + def get_inputs(self): + return self.lineage_collector.collected_datasets.inputs + +If no ``HookLineageReader`` is registered within Airflow, a default ``NoOpCollector`` is used instead. Review Comment: Cool, plugins feel like a better home. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
