mjpieters commented on issue #12771:
URL: https://github.com/apache/airflow/issues/12771#issuecomment-753404477
Interesting little test; but note it generates a trace from the tasks after
the fact without support for additional spans created __inside__ each task
invocation, and you the tracing context is not shared overy any RPC / REST /
etc. calls to other services to inherit the tracing context.
I integrated Jaeger tracing (opentracing) support into a production Airflow
setup using Celery. Specific challenges I had to overcome:
- The dagrun span is 'virtual', in that it exists from when the dagrun is
created until the last task is completed. The scheduler will then update the
dagrun state in the database. But tasks need a valid parent span to attach
their own spans to.
I solved this by creating a span in a monkey-patch to
`DAG.create_dagrun()`, injecting the span info in to the dagrun configuration
together with the start time, then _discarding the span_. Then, in
`DagRun.set_state()`, when the state changes to a finished state, I create a
`jaeger_client.span.Span()` object from scratch using the dagrun conf-stored
data, and submit that to Jaeger.
- Tasks inherit the parent (dagrun) span context from the dagrun config; I
patched `Task.run_raw_task()` to run the actual code under a
`tracer.start_active_span()` context manager. This captures timing and any
exceptions.
- You need an active tracer for traces to be captured and sent on to the
tracer agent. So I registered code to run in the
`cli_action_loggers.register_pre_exec_callback()` hook when the `scheduler` or
`dag_trigger` sub-commands run, which then registers a closer with
`cli_action_loggers.register_post_exec_callback`. Closing a tracer in
`dag_trigger` takes careful work with the asyncio / tornado loop used by the
Jaeger client, you'll lose traces if you don't watch out. I found that you had
to go hunt for the I/O loop attached to the trace reporter object and call
`tracer.close()` from a callback sent to that loop as the only fail-proof
method of getting the last traces out. I don't know if opentracing needs this
level of awareness of the implementation details.
But, with that work in place, we now get traces in mostly real time, with
full support for tracing contexts being shared with other services called from
Airflow tasks. We can trace a job through the frontend, submitting a job to
Airflow, then follow any calls from tasks to further REST APIs, all as one
system.
I'd prefer it if the tracing context was not shoehorned into the dagrun
configuration; I'd have created additional database tables or columns for this
in the Airflow database model if I had to do this inside the Airflow project
itself.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]