mjpieters commented on issue #12771:
URL: https://github.com/apache/airflow/issues/12771#issuecomment-753404477


   Interesting little test; but note it generates a trace from the tasks after 
the fact without support for additional spans created __inside__ each task 
invocation, and you the tracing context is not shared overy any RPC / REST / 
etc. calls to other services to inherit the tracing context.
   
   I integrated Jaeger tracing (opentracing) support into a production Airflow 
setup using Celery. Specific challenges I had to overcome:
   
   - The dagrun span is 'virtual', in that it exists from when the dagrun is 
created until the last task is completed. The scheduler will then update the 
dagrun state in the database. But tasks need a valid parent span to attach 
their own spans to.
   
     I solved this by creating a span in a monkey-patch to 
`DAG.create_dagrun()`, injecting the span info in to the dagrun configuration 
together with the start time, then _discarding the span_.  Then, in 
`DagRun.set_state()`, when the state changes to a finished state, I create a 
`jaeger_client.span.Span()` object from scratch using the dagrun conf-stored 
data, and submit that to Jaeger.
   
   - Tasks inherit the parent (dagrun) span context from the dagrun config; I 
patched `Task.run_raw_task()` to run the actual code under a 
`tracer.start_active_span()` context manager. This captures timing and any 
exceptions.
   
   - You need an active tracer for traces to be captured and sent on to the 
tracer agent. So I registered code to run in the 
`cli_action_loggers.register_pre_exec_callback()` hook when the `scheduler` or 
`dag_trigger` sub-commands run, which then registers a closer with 
`cli_action_loggers.register_post_exec_callback`. Closing a tracer in 
`dag_trigger` takes careful work with the asyncio / tornado loop used by the 
Jaeger client, you'll lose traces if you don't watch out. I found that you had 
to go hunt for the I/O loop attached to the trace reporter object and call 
`tracer.close()` from a callback sent to that loop as the only fail-proof 
method of getting the last traces out. I don't know if opentracing needs this 
level of awareness of the implementation details.
   
   But, with that work in place, we now get traces in mostly real time, with 
full support for tracing contexts being shared with other services called from 
Airflow tasks. We can trace a job through the frontend, submitting a job to 
Airflow, then follow any calls from tasks to further REST APIs, all as one 
system.
   
   I'd prefer it if the tracing context was not shoehorned into the dagrun 
configuration; I'd have created additional database tables or columns for this 
in the Airflow database model if I had to do this inside the Airflow project 
itself.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to