To the *NEW* Solr dev list, CC'ing some interested parties... At work I've been exploring distributed tracing with my colleagues. Visualizing Solr interactions in the context of a bigger service is simply amazing, leading to faster/deeper insights than logging IMO. Thankfully, Solr 8.2 (credit to Cao Dat in https://issues.apache.org/jira/browse/SOLR-13434 ) has distributed tracing support and with a ready-made plugin based on Jaeger. At work we use Zipkin (via Brave API); I've never used anything else. We'll likely contribute our Brave/Zipkin "TracerConfigurator" plugin. Despite Solr "having tracing", I consider it just the first step into more/better tracing.
I'll enumerate some proposed changes / improvements and simply things to discuss. Remember, I've only used Zipkin: * Tracing != sampling. You can have a trace that is not sampled! At least in Zipkin, Sampling means "reporting" (sending) the trace to a tracing server where it can be stored/analyzed/visualized. The point of a non-sampled trace is propagating IDs for logging (trace ID in MDC). It's pretty fantastic IMO -- very light-weight. Zipkin has its own samplers. When Solr receives a request with a trace ID, in Zipkin it also includes the binary sampling decision (it's another header). The expectation is that if the trace says to sample, then this sampling decision is propagated downstream and thus the whole call tree is fully sampled (reported to a server). ** If we embrace tracing by default (traces that aren't sampled), we could re-implement Solr's request ID "rid" -- SOLR-14566 in favor of a trivial built-in tracer impl. ** Solr has a "samplePercentage" cluster prop but I don't like it for multiple reasons. Firstly its name -- it says nothing about *what* is being sampled! Secondly, I think this feature is best implemented inside the TracerConfigurator plugin abstraction, such that an implementation can choose whether it wants such a thing (maybe the base impl could have it, overridable to be nothing). It's VERY misleading/confusing for a Zipkin user to see a Solr "samplePercentage" feature that does not align with what Zipkin calls sampling. Ideally, the TracerConfigurator plugin implementation can make the decision on whether or not to create a new Trace, and further whether or not to sample (report) it to a server, and whether or not to have such settings be dynamic in a cluster property or simply solr.xml. For example, I'd like to do tracing always, and fully report/sample all collection operations, never report/sample frequent & boring API calls like metrics, and choose X% for everything else. A TracerConfigurator should be able to make this choice. * Solr stores the active Tracer for a request in a ThreadLocal inside GlobalTracer. While that's fine for trivial apps, it isn't for non-trivial multi-threaded servers w/ thread-pools like Solr. Even OpenTracing's documentation advises that the Tracer be stored somewhere specific to a request. The ideal place is obvious -- SolrRequestInfo. As our project has seen, it takes some work to ensure SolrRequestInfo works in some non-trivial scenarios. By piggy-backing off of this existing mechanism, we can ensure that wherever you can get a SolrRequestInfo in Solr (practically anywhere), you can get the Tracer. This isn't so for GlobalTracer's ThreadLocal. I have some WIP code for this. * Solr "injects" the trace into HTTP requests for distributed search and indexing. However, it was forgotten to create new spans for the client side of the request as well. OpenTracing has metadata to demarcate client spans -- withTag(Tags.SPAN_KIND, Tags.SPAN_KIND_CLIENT). This is a serious omission for a Zipkin user because any attempts to create a new server RPC span (at SolrDispatchFilter) will "join" into the caller span -- thus there is only one span no matter how many times Solr calls itself. This visualizes as one big boring span bar with only some annotations along the bar. I was able to turn off this via a supportsJoin=false option in Brave but ideally Solr would have client spans. I have some WIP code to rectify this via customizing InstrumentedHttpListenerFactory to add a client span there and to move the injection spot there, removing it from those other places. * Spans for other server-to-server interaction. Communication with the Overseer via ZK queues ought to "inject" and "extract" traces. Perhaps communication with ZK itself ought to be traced, or perhaps at least annotated/logged on the span at a minimum (a nice feature of tracing). * Internal spans, and adding metadata to spans. There's a lot that can be done here on indexing and search. Mike Drob presented at a Lucene/Solr Revolution years ago where he did some of this.... not sure what became of that. My colleagues have done some of this as well but it's just a POC/hack at the moment. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley
