[PR] feat: automatically inject OL transport info into spark jobs [airflow]

via GitHub Wed, 01 Jan 2025 05:38:19 -0800


kacpermuda opened a new pull request, #45326:
URL: https://github.com/apache/airflow/pull/45326

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.

Feel free to ping committers for the review!

In case of an existing issue, reference it using one of the following:

closes: #ISSUE
related: #ISSUE

How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->
Similar to #44477 , this PR introduces a new feature to OpenLineage
integration. **It will NOT impact users that are not using OpenLineage or have
not explicitly enabled this feature (False by default).**

## TLDR;
When explicitly enabled by the user for supported operators, we will
automatically inject transport information into the Spark job properties. For
example, when submitting a Spark job using the DataprocSubmitJobOperator, we
will configure Spark/OpenLineage integration to use the same transport
configuration that Airflow integration uses.

## Why ?

Currently, this process requires manual configuration by the user, as
described
[here](https://openlineage.io/docs/integrations/spark/configuration/airflow/).
E.g.:
```
DataprocSubmitJobOperator(
task_id="my_task",
# ...
job={
# ...
"spark.openlineage.transport.type": "http",
"spark.openlineage.transport.url": openlineage_url,
"spark.openlineage.transport.compression": "gzip",
"spark.openlineage.transport.auth.apiKey": api_key,
"spark.openlineage.transport.auth.type": "apiKey",
}
)

```
Understanding how various Airflow operators configure Spark allows us to
automatically inject transport information.

## Controlling the Behavior

We provide users with a flexible control mechanism to manage this injection,
combining per-operator enablement with a global fallback configuration. This
design is inspired by the `deferrable` argument in Airflow.

```python
ol_inject_transport_info: bool = conf.getboolean(
"openlineage", "spark_inject_transport_info", fallback=False
)
```
Each supported operator will include an argument like
`ol_inject_transport_info`, which defaults to the global configuration value of
`openlineage.spark_inject_transport_info`. This approach allows users to:

1. Control behavior on a per-job basis by explicitly setting the argument.
2. Rely on a consistent default configuration for all jobs if the argument
is not set.

This design ensures both flexibility and ease of use, enabling users to
fine-tune their workflows while minimizing repetitive configuration. I am aware
that adding an OpenLineage-related argument to the operator will affect all
users, even those not using OpenLineage, but since it defaults to False and can
be ignored, I hope this will not pose any issues.

## How?
The implementation is divided into three parts for better organization and
clarity:

1. **Operator's Code (including the `execute` method):**
Contains minimal logic to avoid overwhelming users who are not actively
working with OpenLineage.

2. **Google's Provider OpenLineage Utils File:**
Handles the logic for accessing Spark properties specific to a given
operator or job.

3. **OpenLineage Provider's Utils:**
Responsible for creating / extracting all necessary information in a
format compatible with the OpenLineage Spark integration. We are also
performing modifications to the Spark properties here.

For some operators parts 1 and 2 may be in the operator's code. In general,
the specific operator / provider will know how to get the spark properties and
the OL will know what to inject and do the injection itself.

---
**^ Add meaningful description above**
Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information.
In case of fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party
License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in a
newsfragment file, named `{pr_number}.significant.rst` or
`{issue_number}.significant.rst`, in
[newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat: automatically inject OL transport info into spark jobs [airflow]

Reply via email to