[
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
T Jake Luciani updated CASSANDRA-7402:
--------------------------------------
Attachment: 7402.txt
Patch to add a histogram and meter for reads and writes. These metrics exist
per column family and are rolled up to keyspace level.
For reads the histogram track for the heap size of query responses (both per
partition and across partitions (for range queries))
For writes the histogram tracks the heap size of single mutations (we already
track and warn users on large batches).
The meters track the aggregate heap usage of reads and writes per node. This is
valuable to track since you can see that you are generating too many aggregate
operations at once.
I changed nodetool cfstats to expose these per column family. Most operators
would want to track this stat in their system and pick values to alert on.
{code}
Average read response bytes per query (last five minutes): 620
Maximum read response bytes per query (last five minutes): 620
Total read response rate bytes/sec (past minute): 7836749
Total read response rate bytes/sec (past five minutes): 2027754
Average write bytes per partition (last five minutes): 620
Maximum write bytes per partition (last five minutes): 620
Total write rate bytes/sec (past minute): 2391983
Total write rate bytes/sec (past five minutes): 2940078
{code}
> Add metrics to track memory used by client requests
> ---------------------------------------------------
>
> Key: CASSANDRA-7402
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
> Project: Cassandra
> Issue Type: Improvement
> Reporter: T Jake Luciani
> Assignee: T Jake Luciani
> Labels: ops, performance, stability
> Fix For: 3.0
>
> Attachments: 7402.txt
>
>
> When running a production cluster one common operational issue is quantifying
> GC pauses caused by ongoing requests.
> Since different queries return varying amount of data you can easily get your
> self into a situation where you Stop the world from a couple of bad actors in
> the system. Or more likely the aggregate garbage generated on a single node
> across all in flight requests causes a GC.
> It would be very useful for operators to see how much garbage the system is
> using to handle in flight mutations and queries.
> It would also be nice to have either a log of queries which generate the most
> garbage so operators can track this. Also a histogram.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)