aditanase commented on issue #9415:
URL: https://github.com/apache/datafusion/issues/9415#issuecomment-2133800229

   Hi all, thanks for adding this and investigating the tracing crate. I'd like 
to suggest being a bit more specific about the goals of adding tracing, before 
jumping in both feet :).  Maybe I can pitch in some use cases to help with this.
   
   My team is prototyping a distributed engine on top of ballista. Since 
ballista doesn't yet have a great UI, we started to look at adding some 
end-to-end tracing (think external client -> flight SQL query -> scheduler -> 
enqueue job -> executors -> DF engine). As we realised there is currently no 
tracing in either project, we quickly found this issue.
   
   I think the tracing crate, together with some of the community subscribers 
(e.g. opentelemetry stack) can solve this problem, even though there are a 
number of challenges:
   - correctly instrumenting streams and async blocks
   - using `or_current` consistently to make sure that changing the log level 
between trace, debug and info doesn't throw away good information
   - tracing across service boundaries (for ballista especially, but also for 
an app embedding datafusion)
   - tracing pub/sub patterns (in ballista we have a job queue through which 
scheduler and workers are decoupled)
   
   To that end, I'd like to understand if reimplementing metrics on top of 
tracing is really what this issue is about, or just an attempt at consolidating 
some of the timing / metrics bookkeeping?
   
   Based on my experience with other systems (mostly on the JVM, building and 
tuning Spark / Kafka deployments), tracing and metrics work really well 
together, but they are rarely conflated.
   This goes especially for core internal metrics that are used by the query 
engine (e.g. explain analyse, cost based optimizations, building a nice UI for 
the scheduler) as opposed to tracing that is typically done through sampling 
(tracing everything at a certain level becomes expensive quickly), has a lot of 
user input (extra context, combined with app metrics) and is configurable 
through the log level.
   
   My suggestion would be to decouple adding tracing (as a tool for people that 
are monitoring / optimizing engines built on top of DF) from the core metrics 
refactoring.
   For metrics specifically there are [other 
crates](https://docs.rs/metrics/latest/metrics/) with more target concepts 
(counters, gauges, histograms) that have some integration with tracing, in 
order to propagate the current span context as metrics labels.
   
   Lastly, if there is not a lot of work started here, I've already started to 
play around with some of the suggestions on this thread (add instrument to 
execute, instrument streams and async blocks, etc) and I'd be interested in 
contributing to this track, especially some of the lessons learned around 
tracing async code and streams.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to