[ 
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

T Jake Luciani updated CASSANDRA-7402:
--------------------------------------
    Attachment: 7402.txt

Patch to add a histogram and meter for reads and writes.  These metrics exist 
per column family and are rolled up to keyspace level.

For reads the histogram track for the heap size of query responses (both per 
partition and across partitions (for range queries))  

For writes the histogram tracks the heap size of single mutations (we already 
track and warn users on large batches).

The meters track the aggregate heap usage of reads and writes per node. This is 
valuable to track since you can see that you are generating too many aggregate 
operations at once.

I changed nodetool cfstats to expose these per column family.   Most operators 
would want to track this stat in their system and pick values to alert on.

{code}
                Average read response bytes per query (last five minutes): 620
                Maximum read response bytes per query (last five minutes): 620
                Total read response rate bytes/sec (past minute): 7836749
                Total read response rate bytes/sec (past five minutes): 2027754
                Average write bytes per partition (last five minutes): 620
                Maximum write bytes per partition  (last five minutes): 620
                Total write rate bytes/sec (past minute): 2391983
                Total write rate bytes/sec (past five minutes): 2940078
{code}

> Add metrics to track memory used by client requests
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7402
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: T Jake Luciani
>            Assignee: T Jake Luciani
>              Labels: ops, performance, stability
>             Fix For: 3.0
>
>         Attachments: 7402.txt
>
>
> When running a production cluster one common operational issue is quantifying 
> GC pauses caused by ongoing requests.
> Since different queries return varying amount of data you can easily get your 
> self into a situation where you Stop the world from a couple of bad actors in 
> the system.  Or more likely the aggregate garbage generated on a single node 
> across all in flight requests causes a GC.
> It would be very useful for operators to see how much garbage the system is 
> using to handle in flight mutations and queries. 
> It would also be nice to have either a log of queries which generate the most 
> garbage so operators can track this.  Also a histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to