[
https://issues.apache.org/jira/browse/CASSANDRA-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142689#comment-14142689
]
Robert Stupp commented on CASSANDRA-7731:
-----------------------------------------
TLDR yammer and codehale produce different values - both implementations rely
on a lot of updates to work as expected and do not seem to work correctly for
low-frequency {{update()}} calls.
* yammer's percentiles (e.g. 95, 99, 99.9) are wrong if not enough {{update()}}
calls are made (e.g. idle column families).
* codehale takes a bit more time to correct the percentiles (about 10 minutes -
which might be roughly "weighted towards 5 minutes"...)
I've written an artificial test that does the same against yammer (2.1) and
codehale (3.1). [Test source available
here|https://bitbucket.org/snazy/histogram-test]
The test runs for 20 minutes and creates three histograms. One is updated 10
times per second, one once per second and one every 15 seconds. The values for
the {{update()}} call is 100 during the first 60 seconds and 2 after.
So the histograms just differ in update frequency - but not in the values.
Results:
* yammer does not reset the min and max values on the metric/histogram (e.g.
{{com.yammer.metrics.core.Histogram#max}}) - but the result for
{{com.yammer.metrics.stats.Snapshot#getValue(1d)}} is corrected - BIG CUTBACK:
yammer does this only if it "gets enough values" (not bound to the documented 5
minute interval - but (also?) to the number of updates) - this does also affect
the 95, 99, 99.9 percentiles
* yammer and codehale produce different values for the _mean_ value (did not
yet inspect which values are correct)
* codehale returns the "correct" max value (and updates other percentiles)
after some certain time - the more {{update()}} calls, the faster the value is
updates
* nodetool (the 2.1 patch) uses {{Histogram.max()}} via
{{JmxReporter.HistogramMBean}} - so the displayed max value is never reset.
{{JmxReporter.HistogramMBean}} does not support some {{getValue(double)}} call.
We could switch to {{get999thPercentile()}} to mitigate the problem.
Unfortunately I do not see a solution for the Cassandra-2.1 branch. I'd rather
propose to revert the patch to display the max values since I'm not sold on the
_99.9th percentile_.
For C* 3.0 there are alredy tickets regarding metrics CASSANDRA-5657 +
CASSANDRA-6486.
> Get max values for live/tombstone cells per slice
> -------------------------------------------------
>
> Key: CASSANDRA-7731
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7731
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Cyril Scetbon
> Assignee: Robert Stupp
> Priority: Minor
> Fix For: 2.1.1
>
> Attachments: 7731-2.0.txt, 7731-2.1.txt
>
>
> I think you should not say that slice statistics are valid for the [last five
> minutes
> |https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/tools/NodeCmd.java#L955-L956]
> in CFSTATS command of nodetool. I've read the documentation from yammer for
> Histograms and there is no way to force values to expire after x minutes
> except by
> [clearing|http://grepcode.com/file/repo1.maven.org/maven2/com.yammer.metrics/metrics-core/2.1.2/com/yammer/metrics/core/Histogram.java#96]
> it . The only thing I can see is that the last snapshot used to provide the
> median (or whatever you'd used instead) value is based on 1028 values.
> I think we should also be able to detect that some requests are accessing a
> lot of live/tombstone cells per query and that's not possible for now without
> activating DEBUG for SliceQueryFilter for example and by tweaking the
> threshold. Currently as nodetool cfstats returns the median if a low part of
> the queries are scanning a lot of live/tombstone cells we miss it !
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)