Re: Collection API for performance monitoring?

Jeff Wartes Thu, 17 Nov 2016 09:40:33 -0800

I’m a big fan of the codehale/dropwizard Metrics library, but if you’re really 
serious about your histograms, even Metrics’s histograms are vulnerable to the 
Coordinated Omission problem.


I haven’t used it yet, but I’ve been itching to try out this: 
https://bitbucket.org/marshallpierce/hdrhistogram-metrics-reservoir which 
bridges the HdrHistogram into Metrics.
That link also has links to some of the reading on the subject, which is pretty 
interesting stuff.


From: Walter Underwood <wun...@wunderwood.org>
Reply-To: "dev@lucene.apache.org" <dev@lucene.apache.org>
Date: Tuesday, November 15, 2016 at 8:31 AM
To: "dev@lucene.apache.org" <dev@lucene.apache.org>
Subject: Re: Collection API for performance monitoring?

To calculate percentiles we need all the data points. If there is a lot of 
data, it could be sampled.

Average can be calculated with the total time and the number of requests. 
Snapshots of those
two values allow snapshots of averages.

But averages are the wrong metric for a one-sided distribution like response 
time. Let’s assume
that any response longer than 10 seconds is a bad experience. Percentiles will 
tell you what
response time 95% of customer searches are getting. With averages, a single 30 
second response
time will increase the metric, even though it is “just as broken” as a 15 s 
response.

wunder
Walter Underwood
wun...@wunderwood.org<mailto:wun...@wunderwood.org>
http://observer.wunderwood.org/  (my blog)

On Nov 15, 2016, at 7:27 AM, Ryan Josal 
<rjo...@gmail.com<mailto:rjo...@gmail.com>> wrote:

I haven't tried for 95th percentile, but generally with those collection start 
stats you would monitor based on calculated deltas.  You can figure out the 
average response time for any given window of time not smaller than your 
snapshot polling interval.  I don't see why 95th percentile would be any 
different.

Ryan

On Monday, November 14, 2016, Walter Underwood 
<wun...@wunderwood.org<mailto:wun...@wunderwood.org>> wrote:
Because the current stats are not usable. They really should be removed from 
the code.

They calculate percentiles since the last collection load. We need to know 95th 
percentile
during the peak hour last night, not the 95th for the last month.

Right now, we run eleven collections in our Solr 4 cluster. In each collection, 
we have
several different handlers. Usually, one for autosuggest (instant results), one 
for the SRP,
and one for mobile, though we also have SEO requests and so on. We can track 
performance
for each of these.

wunder
Walter Underwood
wun...@wunderwood.org<javascript:_e(%7B%7D,'cvml','wun...@wunderwood.org');>
http://observer.wunderwood.org/  (my blog)

On Nov 14, 2016, at 3:54 PM, Erick Erickson 
<erickerick...@gmail.com<javascript:_e(%7B%7D,'cvml','erickerick...@gmail.com');>>
 wrote:

Point taken, and thanks for the link. The stats I'm referring to in
this thread are available now, and would (I think) be a quick win. I
don't have a huge amount of investment in it though, more "why didn't
we think of this before?" followed by "maybe there's a very good
reason not to bother". This may be it since we now standardize on
Jetty. My question of course is whether this would be supported moving
forward to netty or whatever...

Best,
Erick

On Mon, Nov 14, 2016 at 3:44 PM, Walter Underwood 
<wun...@wunderwood.org<javascript:_e(%7B%7D,'cvml','wun...@wunderwood.org');>> 
wrote:

I’m not fond of polling for performance stats. I’d rather have the app
report them.

We could integrate existing Jetty monitoring:

http://metrics.dropwizard.io/3.1.0/manual/jetty/

From our experience with a similar approach, we might need some
Solr-specific metric
conflation. SolrJ sends a request to /solr/collection/handler as
/solr/collection/select?qt=/handler.
In our code, we fix that request to the intended path. We’ve been running a
Tomcat metrics search
filter for three years.

Also, see:

https://issues.apache.org/jira/browse/SOLR-8785

wunder
Walter Underwood
wun...@wunderwood.org<javascript:_e(%7B%7D,'cvml','wun...@wunderwood.org');>
http://observer.wunderwood.org/  (my blog)


On Nov 14, 2016, at 3:25 PM, Erick Erickson 
<erickerick...@gmail.com<javascript:_e(%7B%7D,'cvml','erickerick...@gmail.com');>>
 wrote:

What do people think about exposing a Collections API call (name TBD,
but the sense is PERFORMANCESTATS) that would simply issue the
admin/mbeans call to each replica of a collection and report them
back. This would give operations monitors the ability to see, say,
anomalous replicas that had poor average response times for the last 5
minutes and the like.

Seems like an easy enhancement that would make ops people's lives easier.

I'll raise a JIRA if there's interest, but sure won't make progress on
it until I clear my plate of some other JIRAs that I've let linger for
far too long.

Erick

---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.org<javascript:_e(%7B%7D,'cvml','dev-unsubscr...@lucene.apache.org');>
For additional commands, e-mail: 
dev-h...@lucene.apache.org<javascript:_e(%7B%7D,'cvml','dev-h...@lucene.apache.org');>


---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.org<javascript:_e(%7B%7D,'cvml','dev-unsubscr...@lucene.apache.org');>
For additional commands, e-mail: 
dev-h...@lucene.apache.org<javascript:_e(%7B%7D,'cvml','dev-h...@lucene.apache.org');>

Re: Collection API for performance monitoring?

Reply via email to