I've created ARROW-12671 [1] to track this work and filed a draft PR [2]; I'd appreciate any feedback, particularly from anyone already trying to use OpenTelemetry/Tracing/Census with Arrow.
For dependencies: now we use OpenTelemetry as header-only by default. I also slimmed down the build, avoiding making the build wait on OpenTelemetry. By setting a CMake flag, you can link Arrow against OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that can be toggled via environment variable. For Python: the PR includes basic integration with Flight/Python. The C++ side will start a span, then propagate it to Python. Spans in Python will not propagate back to C++, and Python/C++ need to both set up their respective exporters. I plan to poke the upstream community about if there's a good solution to this kind of issue. For ABI compatibility: this will be an issue until upstream reaches 1.0. Even currently, there's an unreleased change on their main branch which will break the current PR when it's released. Hopefully, they will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want to avoid shipping this until there is a 1.0. I have confirmed that linking an application which itself links OpenTelemetry to Arrow works. As for the overhead: I measured the impact on a dataset scan recording ~900 spans per iteration and there was no discernible effect on runtime compared to an uninstrumented scan (though again, this is not that many spans). Best, David [1]: https://issues.apache.org/jira/browse/ARROW-12671 [2]: https://github.com/apache/arrow/pull/10260 On 2021/05/01 19:53:45, "David Li" <lidav...@apache.org> wrote: > Thanks everyone for all the comments. Responding to a few things: > > > It seems to me it would be fairly implementation dependent -- so each > > language implementation would choose if it made sense for them and then > > implement the appropriate connection to that language's open telemetry > > ecosystem. > > Agreed - I think the important thing is to agree on using OpenTelemetry > itself so that the various Flight implementations, for instance, can all > contribute compatible trace data. And there will be details like naming of > keys for extra metadata we might want to attach, or trying to make (some) > span names consistent. > > > My main question is: does integrating OpenTracing complicate our build > > procedure? Is it header-only as long as you use the no-op tracer? Or > > do you have to build it and link with it nonetheless? > > I need to look into this more and will follow up. I believe we can use it > header-only. It's fairly simple to depend on (and has no required > dependencies), but it is a synchronous build step (you must build it to have > its headers available) - perhaps that could be resolved upstream or I am > configuring CMake wrongly. Right now, I've linked in OpenTelemetry to provide > a few utilities (e.g. logging data to stdout as JSON), but that could be > split out into a libarrow_tracing.so if we keep them. > > > Also, are there ABI issues that may complicate integration into > > applications that were compiled against another version of OpenTracing? > > Upstream already seems to be considering ABI compatibility. However, until > they reach 1.0, of course they need not keep any promises, and that is a > worry depending on their timeline. As pointed out already, they are moving > quickly, but they are behind the other languages' OpenTelemetry > implementations. > > > I'm not sure what the overhead is when disabled--I think it is probably > > minimal or else it wouldn't be used so widely. But if we're not ready to > > jump right in, we could introduce our own @WithSpan annotation which by > > default is a no-op. To build an instrumented Arrow lib, you'd hook it up > > with a shim. > > I am focusing on C++ here but of course the other languages come into play. A > similar idea for C++ may be useful if we need to have OpenTelemetry be > optional to avoid ABI worries. A branch may also work, but I'd like to avoid > that if possible. > > Best, > David > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote: > > I agree that OpenTelemetry is the future; I have been following the > > observability space off and on and I knew about OpenTracing; I just > > realized that OpenTelemetry is its successor. [1] > > I have found tracing to be a very powerful approach; at one point, I did a > > POC of a trace recorder inside a Java webapp, which shed light on some > > nasty bottlenecks. If integrated properly, it can be left on all the time, > > so it's valuable for doing root-cause analysis in production. At least in > > Java, there are already a lot of packages with OpenTelemetry hooks built > > in. [2] > > I'm not sure what the overhead is when disabled--I think it is probably > > minimal or else it wouldn't be used so widely. But if we're not ready to > > jump right in, we could introduce our own @WithSpan annotation which by > > default is a no-op. To build an instrumented Arrow lib, you'd hook it up > > with a shim. Or you could just maintain a branch with instrumentation for > > people to try it out. > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/ > > [2] > > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md > > > > On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com > > <mailto:evan%40urbanlogiq.com>> wrote: > > > Dear David, > > > > > > OpenTelemetry tracing is definitely the future, I guess the question is > > > how far down the stack we want to put it. I think it would be useful > > > for flight and other higher level modules, and for DataFusion for example > > > it would be really useful. > > > As for being alpha, I don’t think it will stay that way very long, there > > > is a ton of industry momentum behind OpenTelemetry. > > > > > > -Evan > > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org > > > > <mailto:lidavidm%40apache.org>> wrote: > > > > > > > > Hello, > > > > > > > > For Arrow Datasets, I've been working to instrument the scanner to find > > > > bottlenecks. For example, here's a demo comparing the current async > > > > scanner, which doesn't truly read asynchronously, to one that does; it > > > > should be fairly evident where the bottleneck is: > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html > > > > > > > > I'd like to upstream this, but I'd like to run some questions by > > > > everyone first: > > > > - Does this look useful to developers working on other sub-projects? > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we > > > > comfortable with adopting it? Is the overhead acceptable? > > > > - Is there anyone using Arrow to build services, that would find more > > > > general integration useful? > > > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a "span" > > > > for operations like reading a single record batch. The data is saved as > > > > JSON, then rendered by some JavaScript. The branch is at [2]. > > > > > > > > As a quick summary, OpenTelemetry implements distributed tracing, in > > > > which a request is tracked as a directed acyclic graph of spans. A span > > > > is just metadata (name, ID, start/end time, parent span, ...) about an > > > > operation (function call, network request, ...). Typically, it's used in > > > > services. Spans can reference each other across machines, so you can > > > > track a request across multiple services (e.g. finding which service > > > > failed/is unusually slow in a chain of services that call each other). > > > > > > > > As opposed to a (sampling) profiler, this gives you application-level > > > > metadata, like filenames or S3 download rates, that you can use in > > > > analysis (as in the demo). It's also something you'd always keep turned > > > > on (at least when running a service). If integrated with Flight, > > > > OpenTelemetry would also give us a performance picture across multiple > > > > machines - speculatively, something like making a request to a Flight > > > > service and being able to trace all the requests it makes to S3. > > > > > > > > It does have some overhead; you wouldn't annotate every function in a > > > > codebase. This is rather anecdotal, but for the demo above, there was > > > > essentially zero impact on runtime. Of course, that demo records very > > > > little data overall, so it's not very representative. > > > > > > > > Alternatives: > > > > - Add a simple Span class of our own, and defer Flight until later. > > > > - Integrate OpenTelemetry in such a way that it gets compiled out if not > > > > enabled at build time. This would be messier but should alleviate any > > > > performance questions. > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their own > > > > caveats (e.g. XRay is LLVM-specific) and aren't intended for the > > > > multi-machine use case, but would otherwise work. I haven't looked > > > > into these much, but could evaluate them, especially if they seem more > > > > fit for purpose for use in other Arrow subprojects. > > > > > > > > If people aren't super enthused, I'll most likely go with adding a > > > > custom Span class for Datasets, and defer the question of whether we > > > > should integrate Flight/Datasets with OpenTelemetry until another use > > > > case arises. But recently we have seen interest in this - so I see this > > > > as perhaps a chance to take care of two problems at once. > > > > > > > > Thanks, > > > > David > > > > > > > > [1]: https://opentelemetry.io/ > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry > > > > [3]: https://perfetto.dev/ > > > > [4]: https://llvm.org/docs/XRay.html > > > > > > > > >