[
https://issues.apache.org/jira/browse/CASSANDRA-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280940#comment-15280940
]
Chris Burroughs commented on CASSANDRA-11752:
---------------------------------------------
So the [point|http://metrics.dropwizard.io/3.1.0/] of the metrics library is to
"insight into what your code does in production". It is integrated into many
projects. Users expect to be able to take those metrics and:
* Draw a [line
graph|http://www.datastax.com/dev/blog/pluggable-metrics-reporting-in-cassandra-2-0-2].
* Alert on values so they know when there are problems with a cluster.
* Use jconsole to inspect beans and determine what is happening Right Now.
I am aware that there are concerns both in implementation and assumptions
(normal distribution) with the metrics library. They have been brought up both
on [this bug tracker|https://issues.apache.org/jira/browse/CASSANDRA-6486] and
other forums. However imperfect, jconsole, line graphs, and threshold based
alerts are of critical practical use today. All of these require *recent*
data. When my cluster is failing to meet business needs I want to know as soon
as possible.
If I understand your proposal correctly, you are saying it would be better to
drop all of that, much more powerful (and mathematically sound!) if we did an
out of band export and merge of all of the histograms and create a heatmap.
This would provide better insight into the distribution of values (by showing
the full distribution instead of a handful of percentiles) and allow for
cluster wide aggregation. This could be further augmented by using [hue and
saturaiton|https://docs.joyent.com/public-cloud/d-40-performance/cloud-analytics/use-of-color-in-cloud-analytics]
to call out latencies for individual nodes or column families. I think that
sounds fantastic, but that is very much not where the industry is today. Maybe
Circonus can do that, but graphite definitely can't.
And however cool that future sounds, the NEWS entry makes no mention of this as
an intentional fundamental change. Nor does CASSANDRA-5657 discuss the
consequences. Indeed CASSANDRA-5657 hoped for improved accuracy and went out of
the way to keep JMX functioning!
> histograms/metrics in 2.2 do not appear recency biased
> ------------------------------------------------------
>
> Key: CASSANDRA-11752
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11752
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Chris Burroughs
> Labels: metrics
> Attachments: boost-metrics.png, c-jconsole-comparison.png,
> c-metrics.png, default-histogram.png
>
>
> In addition to upgrading to metrics3, CASSANDRA-5657 switched to using a
> custom histogram implementation. After upgrading to Cassandra 2.2
> histograms/timer metrics are not suspiciously flat. To be useful for
> graphing and alerting metrics need to be biased towards recent events.
> I have attached images that I think illustrate this.
> * The first two are a comparison between latency observed by a C* 2.2 (us)
> cluster shoring very flat lines and a client (using metrics 2.2.0, ms)
> showing server performance problems. We can't rule out with total certainty
> that something else isn't the cause (that's why we measure from both the
> client & server) but they very rarely disagree.
> * The 3rd image compares jconsole viewing of metrics on a 2.2 and 2.1
> cluster over several minutes. Not a single digit changed on the 2.2 cluster.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)