Re: Improving Tracing

David Smiley Sun, 28 Mar 2021 15:47:26 -0700

https://issues.apache.org/jira/browse/SOLR-15283 "Remove Solr trace
sampling; let Tracer configuration/impl decide"
More to come...


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Mar 8, 2021 at 7:05 PM David Smiley <[email protected]> wrote:

> To the *NEW* Solr dev list, CC'ing some interested parties...
>
> At work I've been exploring distributed tracing with my colleagues.
> Visualizing Solr interactions in the context of a bigger service is simply
> amazing, leading to faster/deeper insights than logging IMO.  Thankfully,
> Solr 8.2 (credit to Cao Dat in
> https://issues.apache.org/jira/browse/SOLR-13434 ) has distributed
> tracing support and with a ready-made plugin based on Jaeger.  At work we
> use Zipkin (via Brave API); I've never used anything else.  We'll likely
> contribute our Brave/Zipkin "TracerConfigurator" plugin.  Despite Solr
> "having tracing", I consider it just the first step into more/better
> tracing.
>
> I'll enumerate some proposed changes / improvements and simply things to
> discuss.  Remember, I've only used Zipkin:
>
> * Tracing != sampling.  You can have a trace that is not sampled!  At
> least in Zipkin, Sampling means "reporting" (sending) the trace to a
> tracing server where it can be stored/analyzed/visualized.  The point of a
> non-sampled trace is propagating IDs for logging (trace ID in MDC).  It's
> pretty fantastic IMO -- very light-weight.  Zipkin has its own samplers.
> When Solr receives a request with a trace ID, in Zipkin it also includes
> the binary sampling decision (it's another header).  The expectation is
> that if the trace says to sample, then this sampling decision is propagated
> downstream and thus the whole call tree is fully sampled (reported to a
> server).
>
> ** If we embrace tracing by default (traces that aren't sampled), we could
> re-implement Solr's request ID "rid" -- SOLR-14566 in favor of a trivial
> built-in tracer impl.
>
> ** Solr has a "samplePercentage" cluster prop but I don't like it for
> multiple reasons.  Firstly its name -- it says nothing about *what* is
> being sampled!  Secondly, I think this feature is best implemented inside
> the TracerConfigurator plugin abstraction, such that an implementation can
> choose whether it wants such a thing (maybe the base impl could have it,
> overridable to be nothing).  It's VERY misleading/confusing for a Zipkin
> user to see a Solr "samplePercentage" feature that does not align with what
> Zipkin calls sampling.  Ideally, the TracerConfigurator plugin
> implementation can make the decision on whether or not to create a new
> Trace, and further whether or not to sample (report) it to a server, and
> whether or not to have such settings be dynamic in a cluster property or
> simply solr.xml.  For example, I'd like to do tracing always, and fully
> report/sample all collection operations, never report/sample frequent &
> boring API calls like metrics, and choose X% for everything else.
> A TracerConfigurator should be able to make this choice.
>
> * Solr stores the active Tracer for a request in a ThreadLocal inside
> GlobalTracer.  While that's fine for trivial apps, it isn't for non-trivial
> multi-threaded servers w/ thread-pools like Solr.  Even OpenTracing's
> documentation advises that the Tracer be stored somewhere specific to a
> request.  The ideal place is obvious -- SolrRequestInfo.  As our project
> has seen, it takes some work to ensure SolrRequestInfo works in some
> non-trivial scenarios.  By piggy-backing off of this existing mechanism, we
> can ensure that wherever you can get a SolrRequestInfo in Solr (practically
> anywhere), you can get the Tracer.  This isn't so for GlobalTracer's
> ThreadLocal.  I have some WIP code for this.
>
> * Solr "injects" the trace into HTTP requests for distributed search and
> indexing.  However, it was forgotten to create new spans for the client
> side of the request as well.  OpenTracing has metadata to demarcate client
> spans -- withTag(Tags.SPAN_KIND, Tags.SPAN_KIND_CLIENT).  This is a serious
> omission for a Zipkin user because any attempts to create a new server RPC
> span (at SolrDispatchFilter) will "join" into the caller span -- thus there
> is only one span no matter how many times Solr calls itself.  This
> visualizes as one big boring span bar with only some annotations along the
> bar.  I was able to turn off this via a supportsJoin=false option in Brave
> but ideally Solr would have client spans.  I have some WIP code to rectify
> this via customizing InstrumentedHttpListenerFactory to add a client span
> there and to move the injection spot there, removing it from those other
> places.
>
> * Spans for other server-to-server interaction.  Communication with the
> Overseer via ZK queues ought to "inject" and "extract" traces.  Perhaps
> communication with ZK itself ought to be traced, or perhaps at least
> annotated/logged on the span at a minimum (a nice feature of tracing).
>
> * Internal spans, and adding metadata to spans.  There's a lot that can be
> done here on indexing and search.  Mike Drob presented at a Lucene/Solr
> Revolution years ago where he did some of this.... not sure what became of
> that.  My colleagues have done some of this as well but it's just a
> POC/hack at the moment.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>

Re: Improving Tracing

Reply via email to