[jira] [Commented] (CASSANDRA-11752) histograms/metrics in 2.2 do not appear recency biased

Chris Lohfink (JIRA) Thu, 12 May 2016 07:39:25 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281549#comment-15281549
 ]


Chris Lohfink commented on CASSANDRA-11752:
-------------------------------------------

Well I will prefix this with I had nothing to do with any of the decisions, or 
have any real say in this. I am just an observer who has been impacted by the 
changes a bit.

bq. If I understand your proposal correctly,
 
apparently not, Ill elaborate a few ideas I listed above

- change the reservoir to use hdr histogram. 
 -- This is how most of the metrics community resolves this (on ML) 
https://bitbucket.org/marshallpierce/hdrhistogram-metrics-reservoir. it has the 
HdrHistogramResetOnSnapshotReservoir (can be implemented in EH too) that would 
essentially do the deltas. Unfortunately when you read 1 attribute at a time 
this will cause issues per codahales comment when people asked for the feature 
in metrics "Definitely not. Concurrency and reset operations don't play 
nicely.".
 -- a lot of the push for using a non-lossy histograms vs the random samping 
reservoirs (pre 2.2) came up every time someone sees one of Gene Tills talks 
for the first time. So this would make a lot of people happy
- Adding an exp decay to the EH
-- Can add forward decay to the values of EH buckets, actually pretty trivial 
to implement (id be willing to give this a shot, also sound fun)
-- This would give the same "recent" view as the ExpDecayingReservoir without 
the randomness that loses outliers.
-  Exposing the clear operation on the mbean, after reading if you clear the 
histogram it would give you what your looking for really.
-- pre 2.2 this is how it worked for cfhistograms and such, there were two ways 
to read each histogram, one that cleared and one that did not.
-- same comment as the ResetOnSnapshotReservoir above, this is easy to manage 
when doing it programmatically but it fails with things like dumb jmx readers.

bq. If I understand your proposal correctly, you are saying it would be better 
to drop all of that, much more powerful (and mathematically sound!) if we did 
an out of band export and merge of all of the histograms and create a heatmap. 
This would provide better insight into the distribution of values (by showing 
the full distribution instead of a handful of percentiles) and allow for 
cluster wide aggregation. This could be further augmented by using hue and 
saturaiton to call out latencies for individual nodes or column families. I 
think that sounds fantastic, but that is very much not where the industry is 
today. Maybe Circonus can do that, but graphite definitely can't.

For what its worth, thats more or less of what opscenter does. It still uses 
percentiles vs heatmap for ease to conceptualize, but it generates percentiles 
on merged histograms vs the _averaging_ of the percentile value (which 
apparently makes some people very very angry). Thats not helpful here, but we 
shouldnt necessarily sacrifice the more accurate mechanism either. There were 
very loud complaints with how latencies were reported before so I was pretty 
glad to see the change in 2.2. I purpose we provide both.

> histograms/metrics in 2.2 do not appear recency biased
> ------------------------------------------------------
>
>                 Key: CASSANDRA-11752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Chris Burroughs
>              Labels: metrics
>         Attachments: boost-metrics.png, c-jconsole-comparison.png, 
> c-metrics.png, default-histogram.png
>
>
> In addition to upgrading to metrics3, CASSANDRA-5657 switched to using  a 
> custom histogram implementation.  After upgrading to Cassandra 2.2 
> histograms/timer metrics are not suspiciously flat.  To be useful for 
> graphing and alerting metrics need to be biased towards recent events.
> I have attached images that I think illustrate this.
>  * The first two are a comparison between latency observed by a C* 2.2 (us) 
> cluster shoring very flat lines and a client (using metrics 2.2.0, ms) 
> showing server performance problems.  We can't rule out with total certainty 
> that something else isn't the cause (that's why we measure from both the 
> client & server) but they very rarely disagree.
>  * The 3rd image compares jconsole viewing of metrics on a 2.2 and 2.1 
> cluster over several minutes.  Not a single digit changed on the 2.2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11752) histograms/metrics in 2.2 do not appear recency biased

Reply via email to