Improving Tracing

David Smiley Mon, 08 Mar 2021 16:05:42 -0800

To the *NEW* Solr dev list, CC'ing some interested parties...

At work I've been exploring distributed tracing with my colleagues.
Visualizing Solr interactions in the context of a bigger service is simply
amazing, leading to faster/deeper insights than logging IMO.  Thankfully,
Solr 8.2 (credit to Cao Dat in
https://issues.apache.org/jira/browse/SOLR-13434 ) has distributed tracing
support and with a ready-made plugin based on Jaeger.  At work we use
Zipkin (via Brave API); I've never used anything else.  We'll likely
contribute our Brave/Zipkin "TracerConfigurator" plugin.  Despite Solr
"having tracing", I consider it just the first step into more/better
tracing.


I'll enumerate some proposed changes / improvements and simply things to
discuss.  Remember, I've only used Zipkin:

* Tracing != sampling.  You can have a trace that is not sampled!  At least
in Zipkin, Sampling means "reporting" (sending) the trace to a tracing
server where it can be stored/analyzed/visualized.  The point of a
non-sampled trace is propagating IDs for logging (trace ID in MDC).  It's
pretty fantastic IMO -- very light-weight.  Zipkin has its own samplers.
When Solr receives a request with a trace ID, in Zipkin it also includes
the binary sampling decision (it's another header).  The expectation is
that if the trace says to sample, then this sampling decision is propagated
downstream and thus the whole call tree is fully sampled (reported to a
server).

** If we embrace tracing by default (traces that aren't sampled), we could
re-implement Solr's request ID "rid" -- SOLR-14566 in favor of a trivial
built-in tracer impl.

** Solr has a "samplePercentage" cluster prop but I don't like it for
multiple reasons.  Firstly its name -- it says nothing about *what* is
being sampled!  Secondly, I think this feature is best implemented inside
the TracerConfigurator plugin abstraction, such that an implementation can
choose whether it wants such a thing (maybe the base impl could have it,
overridable to be nothing).  It's VERY misleading/confusing for a Zipkin
user to see a Solr "samplePercentage" feature that does not align with what
Zipkin calls sampling.  Ideally, the TracerConfigurator plugin
implementation can make the decision on whether or not to create a new
Trace, and further whether or not to sample (report) it to a server, and
whether or not to have such settings be dynamic in a cluster property or
simply solr.xml.  For example, I'd like to do tracing always, and fully
report/sample all collection operations, never report/sample frequent &
boring API calls like metrics, and choose X% for everything else.
A TracerConfigurator should be able to make this choice.

* Solr stores the active Tracer for a request in a ThreadLocal inside
GlobalTracer.  While that's fine for trivial apps, it isn't for non-trivial
multi-threaded servers w/ thread-pools like Solr.  Even OpenTracing's
documentation advises that the Tracer be stored somewhere specific to a
request.  The ideal place is obvious -- SolrRequestInfo.  As our project
has seen, it takes some work to ensure SolrRequestInfo works in some
non-trivial scenarios.  By piggy-backing off of this existing mechanism, we
can ensure that wherever you can get a SolrRequestInfo in Solr (practically
anywhere), you can get the Tracer.  This isn't so for GlobalTracer's
ThreadLocal.  I have some WIP code for this.

* Solr "injects" the trace into HTTP requests for distributed search and
indexing.  However, it was forgotten to create new spans for the client
side of the request as well.  OpenTracing has metadata to demarcate client
spans -- withTag(Tags.SPAN_KIND, Tags.SPAN_KIND_CLIENT).  This is a serious
omission for a Zipkin user because any attempts to create a new server RPC
span (at SolrDispatchFilter) will "join" into the caller span -- thus there
is only one span no matter how many times Solr calls itself.  This
visualizes as one big boring span bar with only some annotations along the
bar.  I was able to turn off this via a supportsJoin=false option in Brave
but ideally Solr would have client spans.  I have some WIP code to rectify
this via customizing InstrumentedHttpListenerFactory to add a client span
there and to move the injection spot there, removing it from those other
places.

* Spans for other server-to-server interaction.  Communication with the
Overseer via ZK queues ought to "inject" and "extract" traces.  Perhaps
communication with ZK itself ought to be traced, or perhaps at least
annotated/logged on the span at a minimum (a nice feature of tracing).

* Internal spans, and adding metadata to spans.  There's a lot that can be
done here on indexing and search.  Mike Drob presented at a Lucene/Solr
Revolution years ago where he did some of this.... not sure what became of
that.  My colleagues have done some of this as well but it's just a
POC/hack at the moment.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Improving Tracing

Reply via email to