[
https://issues.apache.org/jira/browse/CASSANDRA-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281549#comment-15281549
]
Chris Lohfink commented on CASSANDRA-11752:
-------------------------------------------
Well I will prefix this with I had nothing to do with any of the decisions, or
have any real say in this. I am just an observer who has been impacted by the
changes a bit.
bq. If I understand your proposal correctly,
apparently not, Ill elaborate a few ideas I listed above
- change the reservoir to use hdr histogram.
-- This is how most of the metrics community resolves this (on ML)
https://bitbucket.org/marshallpierce/hdrhistogram-metrics-reservoir. it has the
HdrHistogramResetOnSnapshotReservoir (can be implemented in EH too) that would
essentially do the deltas. Unfortunately when you read 1 attribute at a time
this will cause issues per codahales comment when people asked for the feature
in metrics "Definitely not. Concurrency and reset operations don't play
nicely.".
-- a lot of the push for using a non-lossy histograms vs the random samping
reservoirs (pre 2.2) came up every time someone sees one of Gene Tills talks
for the first time. So this would make a lot of people happy
- Adding an exp decay to the EH
-- Can add forward decay to the values of EH buckets, actually pretty trivial
to implement (id be willing to give this a shot, also sound fun)
-- This would give the same "recent" view as the ExpDecayingReservoir without
the randomness that loses outliers.
- Exposing the clear operation on the mbean, after reading if you clear the
histogram it would give you what your looking for really.
-- pre 2.2 this is how it worked for cfhistograms and such, there were two ways
to read each histogram, one that cleared and one that did not.
-- same comment as the ResetOnSnapshotReservoir above, this is easy to manage
when doing it programmatically but it fails with things like dumb jmx readers.
bq. If I understand your proposal correctly, you are saying it would be better
to drop all of that, much more powerful (and mathematically sound!) if we did
an out of band export and merge of all of the histograms and create a heatmap.
This would provide better insight into the distribution of values (by showing
the full distribution instead of a handful of percentiles) and allow for
cluster wide aggregation. This could be further augmented by using hue and
saturaiton to call out latencies for individual nodes or column families. I
think that sounds fantastic, but that is very much not where the industry is
today. Maybe Circonus can do that, but graphite definitely can't.
For what its worth, thats more or less of what opscenter does. It still uses
percentiles vs heatmap for ease to conceptualize, but it generates percentiles
on merged histograms vs the _averaging_ of the percentile value (which
apparently makes some people very very angry). Thats not helpful here, but we
shouldnt necessarily sacrifice the more accurate mechanism either. There were
very loud complaints with how latencies were reported before so I was pretty
glad to see the change in 2.2. I purpose we provide both.
> histograms/metrics in 2.2 do not appear recency biased
> ------------------------------------------------------
>
> Key: CASSANDRA-11752
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11752
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Chris Burroughs
> Labels: metrics
> Attachments: boost-metrics.png, c-jconsole-comparison.png,
> c-metrics.png, default-histogram.png
>
>
> In addition to upgrading to metrics3, CASSANDRA-5657 switched to using a
> custom histogram implementation. After upgrading to Cassandra 2.2
> histograms/timer metrics are not suspiciously flat. To be useful for
> graphing and alerting metrics need to be biased towards recent events.
> I have attached images that I think illustrate this.
> * The first two are a comparison between latency observed by a C* 2.2 (us)
> cluster shoring very flat lines and a client (using metrics 2.2.0, ms)
> showing server performance problems. We can't rule out with total certainty
> that something else isn't the cause (that's why we measure from both the
> client & server) but they very rarely disagree.
> * The 3rd image compares jconsole viewing of metrics on a 2.2 and 2.1
> cluster over several minutes. Not a single digit changed on the 2.2 cluster.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)