Re: [C++] Adopting a library for (distributed) tracing

Weston Pace Tue, 08 Jun 2021 12:21:42 -0700

FWIW, I tried this out yesterday since I was profiling the execution
of the async API reader.  It worked great so +1 from me on that basis.
I did struggle finding a good simple visualization tool.  Do you have
any good recommendations on that front?


On Mon, Jun 7, 2021 at 10:50 AM David Li <lidav...@apache.org> wrote:
>
> Just to give an update on where this stands:
>
> Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to
> use it. This contains a few fixes I submitted for the platforms our
> various CI jobs use, as well as an explicit build flag to support
> header-only use - I think this should alleviate any concerns over it
> adding to our build too much. I'm hopeful this means it can make it
> into 5.0.0, at least with minimal functionality.
>
> For anyone interested in using OpenTelemetry with Arrow, I hope you'll
> have a chance to look through the PR and see if there's any places
> where adding tracing may be useful.
>
> I also touched base with upstream about Python/C++ interop[2] - it
> turns out upstream has thought about this before but doesn't have the
> resources to pursue it at the moment, as the idea is to write an
> API-compatible binding of the C++ library for Python (and presumably
> R, Ruby, etc.) which is more work.
>
> Best,
> David
>
> [1]: https://github.com/apache/arrow/pull/10260
> [2]: https://github.com/open-telemetry/community/discussions/734
>
> On 2021/05/06 18:23:05, David Li <lidav...@apache.org> wrote:
> > I've created ARROW-12671 [1] to track this work and filed a draft PR
> > [2]; I'd appreciate any feedback, particularly from anyone already
> > trying to use OpenTelemetry/Tracing/Census with Arrow.
> >
> > For dependencies: now we use OpenTelemetry as header-only by
> > default. I also slimmed down the build, avoiding making the build wait
> > on OpenTelemetry. By setting a CMake flag, you can link Arrow against
> > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that
> > can be toggled via environment variable.
> >
> > For Python: the PR includes basic integration with Flight/Python. The
> > C++ side will start a span, then propagate it to Python. Spans in
> > Python will not propagate back to C++, and Python/C++ need to both set
> > up their respective exporters. I plan to poke the upstream community
> > about if there's a good solution to this kind of issue.
> >
> > For ABI compatibility: this will be an issue until upstream reaches
> > 1.0. Even currently, there's an unreleased change on their main branch
> > which will break the current PR when it's released. Hopefully, they
> > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want
> > to avoid shipping this until there is a 1.0. I have confirmed that
> > linking an application which itself links OpenTelemetry to Arrow
> > works.
> >
> > As for the overhead: I measured the impact on a dataset scan recording
> > ~900 spans per iteration and there was no discernible effect on
> > runtime compared to an uninstrumented scan (though again, this is not
> > that many spans).
> >
> > Best,
> > David
> >
> > [1]: https://issues.apache.org/jira/browse/ARROW-12671
> > [2]: https://github.com/apache/arrow/pull/10260
> >
> > On 2021/05/01 19:53:45, "David Li" <lidav...@apache.org> wrote:
> > > Thanks everyone for all the comments. Responding to a few things:
> > >
> > > > It seems to me it would be fairly implementation dependent -- so each
> > > > language implementation would choose if it made sense for them and then
> > > > implement the appropriate connection to that language's open telemetry
> > > > ecosystem.
> > >
> > > Agreed - I think the important thing is to agree on using OpenTelemetry 
> > > itself so that the various Flight implementations, for instance, can all 
> > > contribute compatible trace data. And there will be details like naming 
> > > of keys for extra metadata we might want to attach, or trying to make 
> > > (some) span names consistent.
> > >
> > > > My main question is: does integrating OpenTracing complicate our build
> > > > procedure?  Is it header-only as long as you use the no-op tracer?  Or
> > > > do you have to build it and link with it nonetheless?
> > >
> > > I need to look into this more and will follow up. I believe we can use it 
> > > header-only. It's fairly simple to depend on (and has no required 
> > > dependencies), but it is a synchronous build step (you must build it to 
> > > have its headers available) - perhaps that could be resolved upstream or 
> > > I am configuring CMake wrongly. Right now, I've linked in OpenTelemetry 
> > > to provide a few utilities (e.g. logging data to stdout as JSON), but 
> > > that could be split out into a libarrow_tracing.so if we keep them.
> > >
> > > > Also, are there ABI issues that may complicate integration into
> > > > applications that were compiled against another version of OpenTracing?
> > >
> > > Upstream already seems to be considering ABI compatibility. However, 
> > > until they reach 1.0, of course they need not keep any promises, and that 
> > > is a worry depending on their timeline. As pointed out already, they are 
> > > moving quickly, but they are behind the other languages' OpenTelemetry 
> > > implementations.
> > >
> > > > I'm not sure what the overhead is when disabled--I think it is probably 
> > > > minimal or else it wouldn't be used so widely. But if we're not ready 
> > > > to jump right in, we could introduce our own @WithSpan annotation which 
> > > > by default is a no-op. To build an instrumented Arrow lib, you'd hook 
> > > > it up with a shim.
> > >
> > > I am focusing on C++ here but of course the other languages come into 
> > > play. A similar idea for C++ may be useful if we need to have 
> > > OpenTelemetry be optional to avoid ABI worries. A branch may also work, 
> > > but I'd like to avoid that if possible.
> > >
> > > Best,
> > > David
> > >
> > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote:
> > > > I agree that OpenTelemetry is the future; I have been following the 
> > > > observability space off and on and I knew about OpenTracing; I just 
> > > > realized that OpenTelemetry is its successor. [1]
> > > > I have found tracing to be a very powerful approach; at one point, I 
> > > > did a POC of a trace recorder inside a Java webapp, which shed light on 
> > > > some nasty bottlenecks. If integrated properly, it can be left on all 
> > > > the time, so it's valuable for doing root-cause analysis in production. 
> > > > At least in Java, there are already a lot of packages with 
> > > > OpenTelemetry hooks built in. [2]
> > > > I'm not sure what the overhead is when disabled--I think it is probably 
> > > > minimal or else it wouldn't be used so widely. But if we're not ready 
> > > > to jump right in, we could introduce our own @WithSpan annotation which 
> > > > by default is a no-op. To build an instrumented Arrow lib, you'd hook 
> > > > it up with a shim. Or you could just maintain a branch with 
> > > > instrumentation for people to try it out.
> > > >
> > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/
> > > > [2] 
> > > > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md
> > > >
> > > > On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com 
> > > > <mailto:evan%40urbanlogiq.com>> wrote:
> > > > > Dear David,
> > > > >
> > > > > OpenTelemetry tracing is definitely the future, I guess the question 
> > > > > is how far down the stack we want to put it.   I think it would be 
> > > > > useful for flight and other higher level modules, and for DataFusion 
> > > > > for example it would be really useful.
> > > > > As for being alpha, I don’t think it will stay that way very long, 
> > > > > there is a ton of industry momentum behind OpenTelemetry.
> > > > >
> > > > > -Evan
> > > > >
> > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org 
> > > > > > <mailto:lidavidm%40apache.org>> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > For Arrow Datasets, I've been working to instrument the scanner to 
> > > > > > find
> > > > > > bottlenecks. For example, here's a demo comparing the current async
> > > > > > scanner, which doesn't truly read asynchronously, to one that does; 
> > > > > > it
> > > > > > should be fairly evident where the bottleneck is:
> > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
> > > > > >
> > > > > > I'd like to upstream this, but I'd like to run some questions by
> > > > > > everyone first:
> > > > > > - Does this look useful to developers working on other sub-projects?
> > > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we
> > > > > >  comfortable with adopting it? Is the overhead acceptable?
> > > > > > - Is there anyone using Arrow to build services, that would find 
> > > > > > more
> > > > > >  general integration useful?
> > > > > >
> > > > > > How it works: OpenTelemetry[1] is used to annotate and record a 
> > > > > > "span"
> > > > > > for operations like reading a single record batch. The data is 
> > > > > > saved as
> > > > > > JSON, then rendered by some JavaScript. The branch is at [2].
> > > > > >
> > > > > > As a quick summary, OpenTelemetry implements distributed tracing, in
> > > > > > which a request is tracked as a directed acyclic graph of spans. A 
> > > > > > span
> > > > > > is just metadata (name, ID, start/end time, parent span, ...) about 
> > > > > > an
> > > > > > operation (function call, network request, ...). Typically, it's 
> > > > > > used in
> > > > > > services. Spans can reference each other across machines, so you can
> > > > > > track a request across multiple services (e.g. finding which service
> > > > > > failed/is unusually slow in a chain of services that call each 
> > > > > > other).
> > > > > >
> > > > > > As opposed to a (sampling) profiler, this gives you 
> > > > > > application-level
> > > > > > metadata, like filenames or S3 download rates, that you can use in
> > > > > > analysis (as in the demo). It's also something you'd always keep 
> > > > > > turned
> > > > > > on (at least when running a service). If integrated with Flight,
> > > > > > OpenTelemetry would also give us a performance picture across 
> > > > > > multiple
> > > > > > machines - speculatively, something like making a request to a 
> > > > > > Flight
> > > > > > service and being able to trace all the requests it makes to S3.
> > > > > >
> > > > > > It does have some overhead; you wouldn't annotate every function in 
> > > > > > a
> > > > > > codebase. This is rather anecdotal, but for the demo above, there 
> > > > > > was
> > > > > > essentially zero impact on runtime. Of course, that demo records 
> > > > > > very
> > > > > > little data overall, so it's not very representative.
> > > > > >
> > > > > > Alternatives:
> > > > > > - Add a simple Span class of our own, and defer Flight until later.
> > > > > > - Integrate OpenTelemetry in such a way that it gets compiled out 
> > > > > > if not
> > > > > >  enabled at build time. This would be messier but should alleviate 
> > > > > > any
> > > > > >  performance questions.
> > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their 
> > > > > > own
> > > > > >  caveats (e.g. XRay is LLVM-specific) and aren't intended for the
> > > > > >  multi-machine use case, but would otherwise work. I haven't looked
> > > > > >  into these much, but could evaluate them, especially if they seem 
> > > > > > more
> > > > > >  fit for purpose for use in other Arrow subprojects.
> > > > > >
> > > > > > If people aren't super enthused, I'll most likely go with adding a
> > > > > > custom Span class for Datasets, and defer the question of whether we
> > > > > > should integrate Flight/Datasets with OpenTelemetry until another 
> > > > > > use
> > > > > > case arises. But recently we have seen interest in this - so I see 
> > > > > > this
> > > > > > as perhaps a chance to take care of two problems at once.
> > > > > >
> > > > > > Thanks,
> > > > > > David
> > > > > >
> > > > > > [1]: https://opentelemetry.io/
> > > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry
> > > > > > [3]: https://perfetto.dev/
> > > > > > [4]: https://llvm.org/docs/XRay.html
> > > > >
> > > > >
> > > >
> > >
> >

Re: [C++] Adopting a library for (distributed) tracing

Reply via email to