https://issues.apache.org/jira/browse/SOLR-15283 "Remove Solr trace sampling; let Tracer configuration/impl decide" More to come...
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Mon, Mar 8, 2021 at 7:05 PM David Smiley <[email protected]> wrote: > To the *NEW* Solr dev list, CC'ing some interested parties... > > At work I've been exploring distributed tracing with my colleagues. > Visualizing Solr interactions in the context of a bigger service is simply > amazing, leading to faster/deeper insights than logging IMO. Thankfully, > Solr 8.2 (credit to Cao Dat in > https://issues.apache.org/jira/browse/SOLR-13434 ) has distributed > tracing support and with a ready-made plugin based on Jaeger. At work we > use Zipkin (via Brave API); I've never used anything else. We'll likely > contribute our Brave/Zipkin "TracerConfigurator" plugin. Despite Solr > "having tracing", I consider it just the first step into more/better > tracing. > > I'll enumerate some proposed changes / improvements and simply things to > discuss. Remember, I've only used Zipkin: > > * Tracing != sampling. You can have a trace that is not sampled! At > least in Zipkin, Sampling means "reporting" (sending) the trace to a > tracing server where it can be stored/analyzed/visualized. The point of a > non-sampled trace is propagating IDs for logging (trace ID in MDC). It's > pretty fantastic IMO -- very light-weight. Zipkin has its own samplers. > When Solr receives a request with a trace ID, in Zipkin it also includes > the binary sampling decision (it's another header). The expectation is > that if the trace says to sample, then this sampling decision is propagated > downstream and thus the whole call tree is fully sampled (reported to a > server). > > ** If we embrace tracing by default (traces that aren't sampled), we could > re-implement Solr's request ID "rid" -- SOLR-14566 in favor of a trivial > built-in tracer impl. > > ** Solr has a "samplePercentage" cluster prop but I don't like it for > multiple reasons. Firstly its name -- it says nothing about *what* is > being sampled! Secondly, I think this feature is best implemented inside > the TracerConfigurator plugin abstraction, such that an implementation can > choose whether it wants such a thing (maybe the base impl could have it, > overridable to be nothing). It's VERY misleading/confusing for a Zipkin > user to see a Solr "samplePercentage" feature that does not align with what > Zipkin calls sampling. Ideally, the TracerConfigurator plugin > implementation can make the decision on whether or not to create a new > Trace, and further whether or not to sample (report) it to a server, and > whether or not to have such settings be dynamic in a cluster property or > simply solr.xml. For example, I'd like to do tracing always, and fully > report/sample all collection operations, never report/sample frequent & > boring API calls like metrics, and choose X% for everything else. > A TracerConfigurator should be able to make this choice. > > * Solr stores the active Tracer for a request in a ThreadLocal inside > GlobalTracer. While that's fine for trivial apps, it isn't for non-trivial > multi-threaded servers w/ thread-pools like Solr. Even OpenTracing's > documentation advises that the Tracer be stored somewhere specific to a > request. The ideal place is obvious -- SolrRequestInfo. As our project > has seen, it takes some work to ensure SolrRequestInfo works in some > non-trivial scenarios. By piggy-backing off of this existing mechanism, we > can ensure that wherever you can get a SolrRequestInfo in Solr (practically > anywhere), you can get the Tracer. This isn't so for GlobalTracer's > ThreadLocal. I have some WIP code for this. > > * Solr "injects" the trace into HTTP requests for distributed search and > indexing. However, it was forgotten to create new spans for the client > side of the request as well. OpenTracing has metadata to demarcate client > spans -- withTag(Tags.SPAN_KIND, Tags.SPAN_KIND_CLIENT). This is a serious > omission for a Zipkin user because any attempts to create a new server RPC > span (at SolrDispatchFilter) will "join" into the caller span -- thus there > is only one span no matter how many times Solr calls itself. This > visualizes as one big boring span bar with only some annotations along the > bar. I was able to turn off this via a supportsJoin=false option in Brave > but ideally Solr would have client spans. I have some WIP code to rectify > this via customizing InstrumentedHttpListenerFactory to add a client span > there and to move the injection spot there, removing it from those other > places. > > * Spans for other server-to-server interaction. Communication with the > Overseer via ZK queues ought to "inject" and "extract" traces. Perhaps > communication with ZK itself ought to be traced, or perhaps at least > annotated/logged on the span at a minimum (a nice feature of tracing). > > * Internal spans, and adding metadata to spans. There's a lot that can be > done here on indexing and search. Mike Drob presented at a Lucene/Solr > Revolution years ago where he did some of this.... not sure what became of > that. My colleagues have done some of this as well but it's just a > POC/hack at the moment. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley >
