[jira] [Assigned] (CASSANDRA-14281) LatencyMetrics performance

Jeff Jirsa (JIRA) Thu, 01 Mar 2018 09:56:30 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-14281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeff Jirsa reassigned CASSANDRA-14281:
--------------------------------------

    Assignee: Michael Burman

> LatencyMetrics performance
> --------------------------
>
>                 Key: CASSANDRA-14281
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14281
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Michael Burman
>            Assignee: Michael Burman
>            Priority: Major
>
> Currently for each write/read/rangequery/CAS touching the CFS we write a 
> latency metric which takes a lot of processing time (up to 66% of the total 
> processing time if the update was empty).
> The way latencies are recorded is to use both a dropwizard "Timer" as well as 
> "Counter". Latter is used for totalLatency and the previous is decaying 
> metric for rates and certain percentile metrics. We then replicate all of 
> these CFS writes to the KeyspaceMetrics and globalWriteLatencies.
> For example, for each CFS write we do first write to the CFS's metrics and 
> then to Keyspace's metrics and finally globalMetrics. The way Timer is built 
> is to maintain a Histogram and a Meter and update both when Timer is updated. 
>  The Meter then updates 4 different values (1 minute rate, 5 minute rate, 15 
> minutes rate and a counter).
> So for each CFS write we actually do 15 different counter updates. And then 
> of course maintain their states at the same time while writing. These 
> operations are very slow when combined.
> A small JMH benchmark doing an update against a single LatencyMetrics with 4 
> threads gives us around 5.2M updates / second. With the current writeLatency 
> metric (having 2 parents) we get only 1.6M updates / second. 
> I'm proposing to update this to use a small circular buffer HdrHistogram 
> implementation. We would maintain a rolling buffer with last 15 minutes of 
> histograms (30 seconds per histogram) and update the correct bucket each 
> time. When requesting metrics we would then merge requested amount of buckets 
> to a new histogram and parse results from it. This moves some of the load 
> from writing of the metrics to reading them (which is much more infrequent 
> operation), including the parent metrics. It also allows us to maintain the 
> current metrics structure - if we wish to do so.
> My prototype with this approach improves the performance to around 13.8M 
> updates/second, thus almost 9 times faster than the current approach. We also 
> maintain HdrHistogram already in the Cassandra's lib so there's no new 
> dependencies to add (java-driver also uses it). 
> FUTURE:
> This opens up some possibilities, such as replacing all dropwizard 
> Histograms/Meters with the new approach (to reduce overhead elsewhere in the 
> codebase). It would also allow us to supply downloadable histograms directly 
> from the Cassandra or store them to the disk each time a bucket is filled if 
> user wishes to monitor latency history or graph all percentiles. 
> HdrHistogram also provides the ability to "fix" these histograms with pause 
> tracking, such as GC pauses which we currently can't do (as dropwizard 
> histograms can't be merged).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (CASSANDRA-14281) LatencyMetrics performance

Reply via email to