[ 
https://issues.apache.org/jira/browse/CASSANDRA-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142689#comment-14142689
 ] 

Robert Stupp commented on CASSANDRA-7731:
-----------------------------------------

TLDR yammer and codehale produce different values - both implementations rely 
on a lot of updates to work as expected and do not seem to work correctly for 
low-frequency {{update()}} calls.
* yammer's percentiles (e.g. 95, 99, 99.9) are wrong if not enough {{update()}} 
calls are made (e.g. idle column families).
* codehale takes a bit more time to correct the percentiles (about 10 minutes - 
which might be roughly "weighted towards 5 minutes"...)

I've written an artificial test that does the same against yammer (2.1) and 
codehale (3.1). [Test source available 
here|https://bitbucket.org/snazy/histogram-test]

The test runs for 20 minutes and creates three histograms. One is updated 10 
times per second, one once per second and one every 15 seconds. The values for 
the {{update()}} call is 100 during the first 60 seconds and 2 after.
So the histograms just differ in update frequency - but not in the values.

Results:
* yammer does not reset the min and max values on the metric/histogram (e.g. 
{{com.yammer.metrics.core.Histogram#max}}) - but the result for 
{{com.yammer.metrics.stats.Snapshot#getValue(1d)}} is corrected  - BIG CUTBACK: 
yammer does this only if it "gets enough values" (not bound to the documented 5 
minute interval - but (also?) to the number of updates) - this does also affect 
the 95, 99, 99.9 percentiles
* yammer and codehale produce different values for the _mean_ value (did not 
yet inspect which values are correct)
* codehale returns the "correct" max value (and updates other percentiles) 
after some certain time - the more {{update()}} calls, the faster the value is 
updates
* nodetool (the 2.1 patch) uses {{Histogram.max()}} via 
{{JmxReporter.HistogramMBean}} - so the displayed max value is never reset. 
{{JmxReporter.HistogramMBean}} does not support some {{getValue(double)}} call. 
We could switch to {{get999thPercentile()}} to mitigate the problem.

Unfortunately I do not see a solution for the Cassandra-2.1 branch. I'd rather 
propose to revert the patch to display the max values since I'm not sold on the 
_99.9th percentile_.

For C* 3.0 there are alredy tickets regarding metrics CASSANDRA-5657 + 
CASSANDRA-6486.

> Get max values for live/tombstone cells per slice
> -------------------------------------------------
>
>                 Key: CASSANDRA-7731
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7731
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Cyril Scetbon
>            Assignee: Robert Stupp
>            Priority: Minor
>             Fix For: 2.1.1
>
>         Attachments: 7731-2.0.txt, 7731-2.1.txt
>
>
> I think you should not say that slice statistics are valid for the [last five 
> minutes 
> |https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/tools/NodeCmd.java#L955-L956]
>  in CFSTATS command of nodetool. I've read the documentation from yammer for 
> Histograms and there is no way to force values to expire after x minutes 
> except by 
> [clearing|http://grepcode.com/file/repo1.maven.org/maven2/com.yammer.metrics/metrics-core/2.1.2/com/yammer/metrics/core/Histogram.java#96]
>  it . The only thing I can see is that the last snapshot used to provide the 
> median (or whatever you'd used instead) value is based on 1028 values.
> I think we should also be able to detect that some requests are accessing a 
> lot of live/tombstone cells per query and that's not possible for now without 
> activating DEBUG for SliceQueryFilter for example and by tweaking the 
> threshold. Currently as nodetool cfstats returns the median if a low part of 
> the queries are scanning a lot of live/tombstone cells we miss it !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to