Re: [DISCUSS] Reworking dag run and task OTEL spans

Daniel Standish Tue, 17 Feb 2026 16:52:33 -0800

Thanks Christos

Please take a look at the screenshots in my pr


You can see what it looks like

I am able to use the parent dag run and task spans even down into task sdk

The only thing you don’t get is a green bar that lasts the whole time but
to me this doesn’t really matter

Please take a look when you have a chance. I think it’s a good approach
with very small trade off.

On Tue, Feb 17, 2026 at 9:37 AM Christos Bisias <[email protected]>
wrote:

> Hi Daniel,
>
> I was the author of the current OTel spans approach and I would like to
> share some context.
>
> The spans for the dag_run and the tasks aren't adding much value because
> all the info is already available in the web server. The main benefit is
> the ability to add sub-spans from under tasks and monitor individual steps
> or external operations. That's why it was designed this way.
>
> You need the task span to stay active long enough to get its context and
> propagate it under the task. After that you can use it to create sub-spans.
> If the parent span isn't active while you are creating the sub-span, the
> visual result won't make much sense.
>
> Initially, I had an idea to create a short span in the beginning of the
> task and then a short span at the end, but it doesn't look good when there
> are multiple sub-spans and it's hard to figure out what is the order of the
> steps and their placing in the dag_run.
>
> *Possible tree*
> dag_run    [------------------------------------------]
> task_1        [-----]
> task_2                [----------------------------------]
> task_2_1                [--------]
> task_2_2                           [----]
> task_2_3                                 [----------------]
>
> The implementation ended up being quite complex due to scheduler HA. The
> span objects can't be shared outside of the process that created them, they
> are thread local. One dag_run can be so long-running that the scheduler
> that started processing it, isn't the one that marks it as finished. The
> benefit of having an active span for each scheduler that keeps track of a
> dag_run is to have the ability to get a clear picture of individual task
> steps and get continuous observability.
>
> I'm not trying to point out that it shouldn't be simplified if possible.
> When I contributed these changes, Airflow looked very different from what
> it is now. For example, the task-sdk didn't exist and tasks had direct DB
> access. If for some reason, the task span ended and a new span was started,
> you could get the context_carrier directly from the db and use it to make
> the new span as the parent of any future sub-spans. That's not possible
> anymore.
>
> I'm open to all ideas. I just wanted to explain the complexity.
>
> Christos
>
> On Tue, Feb 17, 2026 at 6:53 PM Daniel Standish <[email protected]>
> wrote:
>
> > I should add a bit of detail....
> >
> > What we currently do is create a span when scheduler realizes a dag run
> is
> > running.  And then we store it in a dictionary.  Then when we detect the
> > dag run ends we check in the dictionary and close it if we can.
> >
> > Similar for tasks.
> >
> > But these spans have to deal with scheduler restarts, different
> schedulers
> > handling the same dag run and tasks etc.
> >
> > I don't think it's really the way you're supposed to create spans.  And
> I'm
> > not one to say we always have to do things the "right" way.  But I don't
> > like the complexity it introduces and I don't see the benefit.
> >
> > So in the PR I rip out all of those "active spans" dictionaries and just
> > create spans when it makes sense and maintain the parent child
> > relationships so you can still see the full flow in the end.
> >
> > On Tue, Feb 17, 2026 at 8:45 AM Daniel Standish <[email protected]>
> > wrote:
> >
> > > Hi I am looking at our OTEL stuff and I have reached the conclusion
> that
> > > we should rework it so we don't jump through hoops to keep alive very
> > long
> > > spans for dag run and task.
> > >
> > > We should still have the spans, we just shouldn't jump through hoops to
> > > ensure that their start and end times match those in the metastore.
> > >
> > > Indeed, that's information that's always already available!
> > >
> > > I do not claim to be an OTEL expert.
> > >
> > > But intuitively we can see that the current approach is very
> complicated
> > > and confusing, and therein probably less reliable and certainly less
> > > maintainable.
> > >
> > > You can see what I've done so far here:
> > > https://github.com/apache/airflow/pull/61897
> > >
> > > Sycophantic though it may be, chat gpt seems to agree with me:
> > > https://chatgpt.com/share/698fea02-fa18-8013-93f7-1e26215bc3f6
> > >
> > > And it makes sense -- just make spans that will automatically be closed
> > > when your specific action is over.
> > >
> > > WDYT?
> > >
> > > Thanks
> > >
> >
>

Re: [DISCUSS] Reworking dag run and task OTEL spans

Reply via email to