[ 
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

T Jake Luciani updated CASSANDRA-7402:
--------------------------------------
    Description: 
When running a production cluster one common operational issue is quantifying 
GC pauses caused by ongoing requests.

Since different queries return varying amount of data you can easily get your 
self into a situation where you Stop the world from a couple of bad actors in 
the system.  Or more likely the aggregate garbage generated on a single node 
across all in flight requests causes a GC.

It would be very useful for operators to see how much garbage the system is 
using to handle in flight mutations and queries. 

It would also be nice to have either a log of queries which generate the most 
garbage so operators can track this.  Also a histogram.


  was:
When running a production cluster one common operational issue is quantifying 
GC pauses caused by ongoing requests.

Since different queries return varying amount of data you can easily get your 
self into a situation where you Stop the world from a couple of bad actors in 
the system.  Or more likely the aggregate garbage generated on a single node 
across all in flight requests causes a GC.

We should be able to set a limit on the max heap we can allocate to all 
outstanding requests and track the garbage per requests to stop this from 
happening.  It should increase a single nodes availability substantially.

In the yaml this would be

{code}
total_request_memory_space_mb: 400
{code}

It would also be nice to have either a log of queries which generate the most 
garbage so operators can track this.  Also a histogram.



> Add metrics to track memory used by client requests
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7402
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: T Jake Luciani
>            Assignee: T Jake Luciani
>              Labels: ops, performance, stability
>             Fix For: 3.0
>
>
> When running a production cluster one common operational issue is quantifying 
> GC pauses caused by ongoing requests.
> Since different queries return varying amount of data you can easily get your 
> self into a situation where you Stop the world from a couple of bad actors in 
> the system.  Or more likely the aggregate garbage generated on a single node 
> across all in flight requests causes a GC.
> It would be very useful for operators to see how much garbage the system is 
> using to handle in flight mutations and queries. 
> It would also be nice to have either a log of queries which generate the most 
> garbage so operators can track this.  Also a histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to