xBis7 opened a new pull request, #43941:
URL: https://github.com/apache/airflow/pull/43941
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.
Feel free to ping committers for the review!
In case of an existing issue, reference it using one of the following:
closes: #ISSUE
related: #ISSUE
How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->
This PR is not a work in progress but I've set it as a draft because this is
my first time contributing to Airflow and I'm not sure if there should be an
AIP or a mailing discussion for such changes. I'd appreciate any feedback.
## Issue description
related: https://github.com/apache/airflow/pull/40802
There are some OpenTelemetry standard practices that help keep the usage
consistent across multiple projects. According to those
* when a root span starts, the trace id and the span id are generated
randomly
* while the span is active, the context is captured
* injected into a carrier object which is a map or a python dictionary
* the carrier with the captured context is propagated across services and
used to create sub spans
* the new sub-span extracts the parent context from the carrier and uses it
to set the parent
* the trace id is accessed from the carrier and the span id is generated
randomly
To explain more, this is what the flow of operations should be
1. Dag run starts - start `dagrun` span
i. Get the `dagrun` span context
2. Start task - with the `dagrun` context, start `ti` span
i. Get the `ti` span context
4. With the `ti` span context create task-sub span
5. Task finishes - `ti` span ends
6. Dag run finishes - `dagrun` span ends
Airflow follows a different approach
* trace and span ids are generated in a deterministic way from various
properties of the `dag_run` and the `task_instance`
* the spans for a dag and its tasks, are generated after the run has finished
This is the flow of the current implementation
1. Dag run starts
2. Tasks start
3. Tasks finish
4. Create spans for the tasks
5. Dag run finishes
6. Create span for the dag run
The current approach makes it impossible to create spans from under tasks
while using the existing airflow code. To achieve that, you need to use
https://github.com/howardyoo/airflow_otel_provider which has to generate the
same trace id and span id as airflow otherwise the spans won't be properly
associated. This isn't easily maintainable and it also makes it hard for people
that are familiar with otel but new to airflow, to start using the feature.
These are some references to OpenTelemetry docs
https://opentelemetry.io/docs/concepts/context-propagation/
https://opentelemetry.io/docs/languages/python/propagation/
https://www.w3.org/TR/trace-context/
## Implementation description
A lot of people might already be using airflow with the existing otel
implementation. To avoid any inconvenience, the changes are hidden behind a
config flag.
This patch is extending the existing implementation and not changing it.
Once the flag is turned on, a new set of spans gets generated and exported. The
new spans have the suffix `_ctx_prop` on their name.

I've reused the attributes from the original implementation. In addition,
the timings are the same.
* without context propagation

* with context propagation

To be able to propagate the context of a span, the span must be active.
For example, for a dag run, the span can't be created at the end but
* it needs to start with the run
* stay active so that the context can be captured and propagated
* end once the run also ends
Same goes for a task and any sub-spans.
With this approach, we can use the new otel methods for creating spans
directly from under a task without the need of the `airflow_otel_provider`.
These spans will be children of the task span.
Check `test_otel_dag.py` for an example of usage.
## Testing
I've added a unit test for the `otel_tracer` methods and an integration test
that runs a dag and then asserts the parent child relationships for the dag
span, the task spans and the task sub spans.
I've also tested the changes manually with a `PythonVirtualenvOperator`
without issues.
<!-- Please keep an empty line above the dashes. -->
---
**^ Add meaningful description above**
Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information.
In case of fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party
License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in a
newsfragment file, named `{pr_number}.significant.rst` or
`{issue_number}.significant.rst`, in
[newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]