aditanase commented on issue #9415: URL: https://github.com/apache/datafusion/issues/9415#issuecomment-2133800229
Hi all, thanks for adding this and investigating the tracing crate. I'd like to suggest being a bit more specific about the goals of adding tracing, before jumping in both feet :). Maybe I can pitch in some use cases to help with this. My team is prototyping a distributed engine on top of ballista. Since ballista doesn't yet have a great UI, we started to look at adding some end-to-end tracing (think external client -> flight SQL query -> scheduler -> enqueue job -> executors -> DF engine). As we realised there is currently no tracing in either project, we quickly found this issue. I think the tracing crate, together with some of the community subscribers (e.g. opentelemetry stack) can solve this problem, even though there are a number of challenges: - correctly instrumenting streams and async blocks - using `or_current` consistently to make sure that changing the log level between trace, debug and info doesn't throw away good information - tracing across service boundaries (for ballista especially, but also for an app embedding datafusion) - tracing pub/sub patterns (in ballista we have a job queue through which scheduler and workers are decoupled) To that end, I'd like to understand if reimplementing metrics on top of tracing is really what this issue is about, or just an attempt at consolidating some of the timing / metrics bookkeeping? Based on my experience with other systems (mostly on the JVM, building and tuning Spark / Kafka deployments), tracing and metrics work really well together, but they are rarely conflated. This goes especially for core internal metrics that are used by the query engine (e.g. explain analyse, cost based optimizations, building a nice UI for the scheduler) as opposed to tracing that is typically done through sampling (tracing everything at a certain level becomes expensive quickly), has a lot of user input (extra context, combined with app metrics) and is configurable through the log level. My suggestion would be to decouple adding tracing (as a tool for people that are monitoring / optimizing engines built on top of DF) from the core metrics refactoring. For metrics specifically there are [other crates](https://docs.rs/metrics/latest/metrics/) with more target concepts (counters, gauges, histograms) that have some integration with tracing, in order to propagate the current span context as metrics labels. Lastly, if there is not a lot of work started here, I've already started to play around with some of the suggestions on this thread (add instrument to execute, instrument streams and async blocks, etc) and I'd be interested in contributing to this track, especially some of the lessons learned around tracing async code and streams. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org