FWIW, I tried this out yesterday since I was profiling the execution of the async API reader. It worked great so +1 from me on that basis. I did struggle finding a good simple visualization tool. Do you have any good recommendations on that front?
On Mon, Jun 7, 2021 at 10:50 AM David Li <lidav...@apache.org> wrote: > > Just to give an update on where this stands: > > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to > use it. This contains a few fixes I submitted for the platforms our > various CI jobs use, as well as an explicit build flag to support > header-only use - I think this should alleviate any concerns over it > adding to our build too much. I'm hopeful this means it can make it > into 5.0.0, at least with minimal functionality. > > For anyone interested in using OpenTelemetry with Arrow, I hope you'll > have a chance to look through the PR and see if there's any places > where adding tracing may be useful. > > I also touched base with upstream about Python/C++ interop[2] - it > turns out upstream has thought about this before but doesn't have the > resources to pursue it at the moment, as the idea is to write an > API-compatible binding of the C++ library for Python (and presumably > R, Ruby, etc.) which is more work. > > Best, > David > > [1]: https://github.com/apache/arrow/pull/10260 > [2]: https://github.com/open-telemetry/community/discussions/734 > > On 2021/05/06 18:23:05, David Li <lidav...@apache.org> wrote: > > I've created ARROW-12671 [1] to track this work and filed a draft PR > > [2]; I'd appreciate any feedback, particularly from anyone already > > trying to use OpenTelemetry/Tracing/Census with Arrow. > > > > For dependencies: now we use OpenTelemetry as header-only by > > default. I also slimmed down the build, avoiding making the build wait > > on OpenTelemetry. By setting a CMake flag, you can link Arrow against > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that > > can be toggled via environment variable. > > > > For Python: the PR includes basic integration with Flight/Python. The > > C++ side will start a span, then propagate it to Python. Spans in > > Python will not propagate back to C++, and Python/C++ need to both set > > up their respective exporters. I plan to poke the upstream community > > about if there's a good solution to this kind of issue. > > > > For ABI compatibility: this will be an issue until upstream reaches > > 1.0. Even currently, there's an unreleased change on their main branch > > which will break the current PR when it's released. Hopefully, they > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want > > to avoid shipping this until there is a 1.0. I have confirmed that > > linking an application which itself links OpenTelemetry to Arrow > > works. > > > > As for the overhead: I measured the impact on a dataset scan recording > > ~900 spans per iteration and there was no discernible effect on > > runtime compared to an uninstrumented scan (though again, this is not > > that many spans). > > > > Best, > > David > > > > [1]: https://issues.apache.org/jira/browse/ARROW-12671 > > [2]: https://github.com/apache/arrow/pull/10260 > > > > On 2021/05/01 19:53:45, "David Li" <lidav...@apache.org> wrote: > > > Thanks everyone for all the comments. Responding to a few things: > > > > > > > It seems to me it would be fairly implementation dependent -- so each > > > > language implementation would choose if it made sense for them and then > > > > implement the appropriate connection to that language's open telemetry > > > > ecosystem. > > > > > > Agreed - I think the important thing is to agree on using OpenTelemetry > > > itself so that the various Flight implementations, for instance, can all > > > contribute compatible trace data. And there will be details like naming > > > of keys for extra metadata we might want to attach, or trying to make > > > (some) span names consistent. > > > > > > > My main question is: does integrating OpenTracing complicate our build > > > > procedure? Is it header-only as long as you use the no-op tracer? Or > > > > do you have to build it and link with it nonetheless? > > > > > > I need to look into this more and will follow up. I believe we can use it > > > header-only. It's fairly simple to depend on (and has no required > > > dependencies), but it is a synchronous build step (you must build it to > > > have its headers available) - perhaps that could be resolved upstream or > > > I am configuring CMake wrongly. Right now, I've linked in OpenTelemetry > > > to provide a few utilities (e.g. logging data to stdout as JSON), but > > > that could be split out into a libarrow_tracing.so if we keep them. > > > > > > > Also, are there ABI issues that may complicate integration into > > > > applications that were compiled against another version of OpenTracing? > > > > > > Upstream already seems to be considering ABI compatibility. However, > > > until they reach 1.0, of course they need not keep any promises, and that > > > is a worry depending on their timeline. As pointed out already, they are > > > moving quickly, but they are behind the other languages' OpenTelemetry > > > implementations. > > > > > > > I'm not sure what the overhead is when disabled--I think it is probably > > > > minimal or else it wouldn't be used so widely. But if we're not ready > > > > to jump right in, we could introduce our own @WithSpan annotation which > > > > by default is a no-op. To build an instrumented Arrow lib, you'd hook > > > > it up with a shim. > > > > > > I am focusing on C++ here but of course the other languages come into > > > play. A similar idea for C++ may be useful if we need to have > > > OpenTelemetry be optional to avoid ABI worries. A branch may also work, > > > but I'd like to avoid that if possible. > > > > > > Best, > > > David > > > > > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote: > > > > I agree that OpenTelemetry is the future; I have been following the > > > > observability space off and on and I knew about OpenTracing; I just > > > > realized that OpenTelemetry is its successor. [1] > > > > I have found tracing to be a very powerful approach; at one point, I > > > > did a POC of a trace recorder inside a Java webapp, which shed light on > > > > some nasty bottlenecks. If integrated properly, it can be left on all > > > > the time, so it's valuable for doing root-cause analysis in production. > > > > At least in Java, there are already a lot of packages with > > > > OpenTelemetry hooks built in. [2] > > > > I'm not sure what the overhead is when disabled--I think it is probably > > > > minimal or else it wouldn't be used so widely. But if we're not ready > > > > to jump right in, we could introduce our own @WithSpan annotation which > > > > by default is a no-op. To build an instrumented Arrow lib, you'd hook > > > > it up with a shim. Or you could just maintain a branch with > > > > instrumentation for people to try it out. > > > > > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/ > > > > [2] > > > > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md > > > > > > > > On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com > > > > <mailto:evan%40urbanlogiq.com>> wrote: > > > > > Dear David, > > > > > > > > > > OpenTelemetry tracing is definitely the future, I guess the question > > > > > is how far down the stack we want to put it. I think it would be > > > > > useful for flight and other higher level modules, and for DataFusion > > > > > for example it would be really useful. > > > > > As for being alpha, I don’t think it will stay that way very long, > > > > > there is a ton of industry momentum behind OpenTelemetry. > > > > > > > > > > -Evan > > > > > > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org > > > > > > <mailto:lidavidm%40apache.org>> wrote: > > > > > > > > > > > > Hello, > > > > > > > > > > > > For Arrow Datasets, I've been working to instrument the scanner to > > > > > > find > > > > > > bottlenecks. For example, here's a demo comparing the current async > > > > > > scanner, which doesn't truly read asynchronously, to one that does; > > > > > > it > > > > > > should be fairly evident where the bottleneck is: > > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html > > > > > > > > > > > > I'd like to upstream this, but I'd like to run some questions by > > > > > > everyone first: > > > > > > - Does this look useful to developers working on other sub-projects? > > > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we > > > > > > comfortable with adopting it? Is the overhead acceptable? > > > > > > - Is there anyone using Arrow to build services, that would find > > > > > > more > > > > > > general integration useful? > > > > > > > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a > > > > > > "span" > > > > > > for operations like reading a single record batch. The data is > > > > > > saved as > > > > > > JSON, then rendered by some JavaScript. The branch is at [2]. > > > > > > > > > > > > As a quick summary, OpenTelemetry implements distributed tracing, in > > > > > > which a request is tracked as a directed acyclic graph of spans. A > > > > > > span > > > > > > is just metadata (name, ID, start/end time, parent span, ...) about > > > > > > an > > > > > > operation (function call, network request, ...). Typically, it's > > > > > > used in > > > > > > services. Spans can reference each other across machines, so you can > > > > > > track a request across multiple services (e.g. finding which service > > > > > > failed/is unusually slow in a chain of services that call each > > > > > > other). > > > > > > > > > > > > As opposed to a (sampling) profiler, this gives you > > > > > > application-level > > > > > > metadata, like filenames or S3 download rates, that you can use in > > > > > > analysis (as in the demo). It's also something you'd always keep > > > > > > turned > > > > > > on (at least when running a service). If integrated with Flight, > > > > > > OpenTelemetry would also give us a performance picture across > > > > > > multiple > > > > > > machines - speculatively, something like making a request to a > > > > > > Flight > > > > > > service and being able to trace all the requests it makes to S3. > > > > > > > > > > > > It does have some overhead; you wouldn't annotate every function in > > > > > > a > > > > > > codebase. This is rather anecdotal, but for the demo above, there > > > > > > was > > > > > > essentially zero impact on runtime. Of course, that demo records > > > > > > very > > > > > > little data overall, so it's not very representative. > > > > > > > > > > > > Alternatives: > > > > > > - Add a simple Span class of our own, and defer Flight until later. > > > > > > - Integrate OpenTelemetry in such a way that it gets compiled out > > > > > > if not > > > > > > enabled at build time. This would be messier but should alleviate > > > > > > any > > > > > > performance questions. > > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their > > > > > > own > > > > > > caveats (e.g. XRay is LLVM-specific) and aren't intended for the > > > > > > multi-machine use case, but would otherwise work. I haven't looked > > > > > > into these much, but could evaluate them, especially if they seem > > > > > > more > > > > > > fit for purpose for use in other Arrow subprojects. > > > > > > > > > > > > If people aren't super enthused, I'll most likely go with adding a > > > > > > custom Span class for Datasets, and defer the question of whether we > > > > > > should integrate Flight/Datasets with OpenTelemetry until another > > > > > > use > > > > > > case arises. But recently we have seen interest in this - so I see > > > > > > this > > > > > > as perhaps a chance to take care of two problems at once. > > > > > > > > > > > > Thanks, > > > > > > David > > > > > > > > > > > > [1]: https://opentelemetry.io/ > > > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry > > > > > > [3]: https://perfetto.dev/ > > > > > > [4]: https://llvm.org/docs/XRay.html > > > > > > > > > > > > > > > > > > >