JDarDagran commented on code in PR #37620:
URL: https://github.com/apache/airflow/pull/37620#discussion_r1501787601
##########
airflow/providers/openlineage/provider.yaml:
##########
@@ -58,65 +58,67 @@ config:
openlineage:
description: |
This section applies settings for OpenLineage integration.
- For backwards compatibility with `openlineage-python` one can still use
- `openlineage.yml` file or `OPENLINEAGE_` environment variables. However,
below
- configuration takes precedence over those.
- More in documentation -
https://openlineage.io/docs/client/python#configuration.
+ More about configuration and it's precedence can be found at
+
https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html#transport-setup
options:
disabled:
description: |
- Set this to true if you don't want OpenLineage to emit events.
+ Disable sending events without uninstalling the OpenLineage Provider
by setting this to true.
type: boolean
example: ~
default: "False"
version_added: ~
disabled_for_operators:
description: |
- Semicolon separated string of Airflow Operator names to disable
+ Exclude some Operators from emitting OpenLineage events by passing a
string of semicolon separated
+ full import paths of Operators to disable.
type: string
example:
"airflow.operators.bash.BashOperator;airflow.operators.python.PythonOperator"
default: ""
version_added: 1.1.0
namespace:
description: |
- OpenLineage namespace
+ Set namespace that the lineage data belongs to, so that if you use
multiple OpenLineage producers,
+ events coming from them will be logically separated.
version_added: ~
type: string
- example: "food_delivery"
+ example: "my_airflow_instance_1"
default: ~
extractors:
description: |
- Semicolon separated paths to custom OpenLineage extractors.
+ Register custom OpenLineage Extractors by passing a string of
semicolon separated full import paths.
type: string
example: full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass
default: ~
version_added: ~
config_path:
description: |
- Path to YAML config. This provides backwards compatibility to pass
config as
+ Provide path to YAML config file. This provides backwards
compatibility to pass config as
`openlineage.yml` file.
Review Comment:
```suggestion
Specify the path to the YAML configuration file.
This ensures backwards compatibility with passing config through
the `openlineage.yml` file.
```
##########
docs/apache-airflow-providers-openlineage/guides/structure.rst:
##########
@@ -17,16 +17,60 @@
under the License.
-Structure of OpenLineage Airflow integration
+OpenLineage Airflow integration
--------------------------------------------
-OpenLineage integration implements AirflowPlugin. This allows it to be
discovered on Airflow start and
-register Airflow Listener.
+OpenLineage is an open framework for data lineage collection and analysis.
+At its core is an extensible specification that systems can use to
interoperate with lineage metadata.
+`Check out OpenLineage docs <https://openlineage.io/docs/>`_.
-The listener is then called when certain events happen in Airflow - when DAGs
or TaskInstances start, complete or fail.
-For DAGs, the listener runs in Airflow Scheduler.
-For TaskInstances, the listener runs on Airflow Worker.
+Quickstart
+==========
+
+To instrument your Airflow instance with OpenLineage, see
:ref:`guides/user:openlineage`.
+
+To implement OpenLineage support for Airflow Operators, see
:ref:`guides/developer:openlineage`.
+
+What's in it for me ?
+=====================
+
+The metadata collected can answer questions like:
+
+- Why did specific data transformation fail?
+- What are the upstream sources feeding into certain dataset?
+- What downstream processes rely on this specific dataset?
+- Is my data fresh?
+- Can I identify the bottleneck in my data processing pipeline?
+- How did the latest code change affect data processing times?
+- How can I trace the cause of data inaccuracies in my report?
+- How are data privacy and compliance requirements being managed through the
data's lifecycle?
+- Are there redundant data processes that can be optimized or removed?
+- What data dependencies exist for this critical report?
+
+Understanding complex inter-DAG dependencies and providing up-to-date runtime
visibility into DAG execution can be challenging.
+OpenLineage integrates with Airflow to collect DAG lineage metadata so that
inter-DAG dependencies are easily maintained
+and viewable via a lineage graph, while also keeping a catalog of historical
runs of DAGs.
+
+.. image::
https://openlineage.io/assets/images/af-schematic-ad8c295a182cb32b94ee27b96727fa98.svg
+ :alt: airflow_lineage
+ :width: 1792
+
+For OpenLineage backend that will receive events, you can use `Marquez
<https://marquezproject.ai/>`_
+
+.. image:: https://marquezproject.ai/img/screenshot.png
Review Comment:
I'm not sure if putting external URLs for images is correct approach. I
didn't find such other example in current Airflow docs.
##########
docs/apache-airflow-providers-openlineage/guides/structure.rst:
##########
@@ -17,16 +17,60 @@
under the License.
-Structure of OpenLineage Airflow integration
+OpenLineage Airflow integration
--------------------------------------------
-OpenLineage integration implements AirflowPlugin. This allows it to be
discovered on Airflow start and
-register Airflow Listener.
+OpenLineage is an open framework for data lineage collection and analysis.
+At its core is an extensible specification that systems can use to
interoperate with lineage metadata.
+`Check out OpenLineage docs <https://openlineage.io/docs/>`_.
-The listener is then called when certain events happen in Airflow - when DAGs
or TaskInstances start, complete or fail.
-For DAGs, the listener runs in Airflow Scheduler.
-For TaskInstances, the listener runs on Airflow Worker.
+Quickstart
+==========
+
+To instrument your Airflow instance with OpenLineage, see
:ref:`guides/user:openlineage`.
+
+To implement OpenLineage support for Airflow Operators, see
:ref:`guides/developer:openlineage`.
+
+What's in it for me ?
+=====================
+
+The metadata collected can answer questions like:
+
+- Why did specific data transformation fail?
+- What are the upstream sources feeding into certain dataset?
+- What downstream processes rely on this specific dataset?
+- Is my data fresh?
+- Can I identify the bottleneck in my data processing pipeline?
+- How did the latest code change affect data processing times?
+- How can I trace the cause of data inaccuracies in my report?
+- How are data privacy and compliance requirements being managed through the
data's lifecycle?
+- Are there redundant data processes that can be optimized or removed?
+- What data dependencies exist for this critical report?
+
+Understanding complex inter-DAG dependencies and providing up-to-date runtime
visibility into DAG execution can be challenging.
+OpenLineage integrates with Airflow to collect DAG lineage metadata so that
inter-DAG dependencies are easily maintained
+and viewable via a lineage graph, while also keeping a catalog of historical
runs of DAGs.
+
+.. image::
https://openlineage.io/assets/images/af-schematic-ad8c295a182cb32b94ee27b96727fa98.svg
Review Comment:
I'm not sure if putting external URLs for images is correct approach. I
didn't find such other example in current Airflow docs.
##########
docs/apache-airflow-providers-openlineage/guides/structure.rst:
##########
@@ -17,16 +17,60 @@
under the License.
-Structure of OpenLineage Airflow integration
+OpenLineage Airflow integration
--------------------------------------------
-OpenLineage integration implements AirflowPlugin. This allows it to be
discovered on Airflow start and
-register Airflow Listener.
+OpenLineage is an open framework for data lineage collection and analysis.
+At its core is an extensible specification that systems can use to
interoperate with lineage metadata.
Review Comment:
```suggestion
At its core it is an extensible specification that systems can use to
interoperate with lineage metadata.
```
##########
airflow/providers/openlineage/provider.yaml:
##########
@@ -58,65 +58,67 @@ config:
openlineage:
description: |
This section applies settings for OpenLineage integration.
- For backwards compatibility with `openlineage-python` one can still use
- `openlineage.yml` file or `OPENLINEAGE_` environment variables. However,
below
- configuration takes precedence over those.
- More in documentation -
https://openlineage.io/docs/client/python#configuration.
+ More about configuration and it's precedence can be found at
+
https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html#transport-setup
options:
disabled:
description: |
- Set this to true if you don't want OpenLineage to emit events.
+ Disable sending events without uninstalling the OpenLineage Provider
by setting this to true.
type: boolean
example: ~
default: "False"
version_added: ~
disabled_for_operators:
description: |
- Semicolon separated string of Airflow Operator names to disable
+ Exclude some Operators from emitting OpenLineage events by passing a
string of semicolon separated
+ full import paths of Operators to disable.
type: string
example:
"airflow.operators.bash.BashOperator;airflow.operators.python.PythonOperator"
default: ""
version_added: 1.1.0
namespace:
description: |
- OpenLineage namespace
+ Set namespace that the lineage data belongs to, so that if you use
multiple OpenLineage producers,
+ events coming from them will be logically separated.
version_added: ~
type: string
- example: "food_delivery"
+ example: "my_airflow_instance_1"
default: ~
extractors:
description: |
- Semicolon separated paths to custom OpenLineage extractors.
+ Register custom OpenLineage Extractors by passing a string of
semicolon separated full import paths.
type: string
example: full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass
default: ~
version_added: ~
config_path:
description: |
- Path to YAML config. This provides backwards compatibility to pass
config as
+ Provide path to YAML config file. This provides backwards
compatibility to pass config as
`openlineage.yml` file.
version_added: ~
type: string
- example: ~
+ example: "full/path/to/openlineage.yml"
default: ""
transport:
description: |
- OpenLineage Client transport configuration. It should contain type
- and additional options per each type.
+ Pass OpenLineage Client transport configuration as JSON string. It
should contain type of the
+ transport and additional options (different for each transport
type). For more details see:
+ https://openlineage.io/docs/client/python/#built-in-transport-types
Currently supported types are:
* HTTP
* Kafka
* Console
+ * File
type: string
- example: '{"type": "http", "url": "http://localhost:5000"}'
+ example: '{"type": "http", "url": "http://localhost:5000", "endpoint":
"api/v1/lineage"}'
default: ""
version_added: ~
disable_source_code:
description: |
- If disabled, OpenLineage events do not contain source code of
particular
- operators, like PythonOperator.
+ Disable including source code in OpenLineage events by setting this
to true. Several Operators (f.e.
+ Python, Bash) will by default include their source code in their
OpenLineage events if not disabled.
Review Comment:
```suggestion
Disable the inclusion of source code in OpenLineage events by
setting this to `true`.
By default, several Operators (e.g. Python, Bash) will include
their source code in the events
unless this is disabled.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]