[
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
T Jake Luciani updated CASSANDRA-7402:
--------------------------------------
Description:
When running a production cluster one common operational issue is quantifying
GC pauses caused by ongoing requests.
Since different queries return varying amount of data you can easily get your
self into a situation where you Stop the world from a couple of bad actors in
the system. Or more likely the aggregate garbage generated on a single node
across all in flight requests causes a GC.
It would be very useful for operators to see how much garbage the system is
using to handle in flight mutations and queries.
It would also be nice to have either a log of queries which generate the most
garbage so operators can track this. Also a histogram.
was:
When running a production cluster one common operational issue is quantifying
GC pauses caused by ongoing requests.
Since different queries return varying amount of data you can easily get your
self into a situation where you Stop the world from a couple of bad actors in
the system. Or more likely the aggregate garbage generated on a single node
across all in flight requests causes a GC.
We should be able to set a limit on the max heap we can allocate to all
outstanding requests and track the garbage per requests to stop this from
happening. It should increase a single nodes availability substantially.
In the yaml this would be
{code}
total_request_memory_space_mb: 400
{code}
It would also be nice to have either a log of queries which generate the most
garbage so operators can track this. Also a histogram.
> Add metrics to track memory used by client requests
> ---------------------------------------------------
>
> Key: CASSANDRA-7402
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
> Project: Cassandra
> Issue Type: Improvement
> Reporter: T Jake Luciani
> Assignee: T Jake Luciani
> Labels: ops, performance, stability
> Fix For: 3.0
>
>
> When running a production cluster one common operational issue is quantifying
> GC pauses caused by ongoing requests.
> Since different queries return varying amount of data you can easily get your
> self into a situation where you Stop the world from a couple of bad actors in
> the system. Or more likely the aggregate garbage generated on a single node
> across all in flight requests causes a GC.
> It would be very useful for operators to see how much garbage the system is
> using to handle in flight mutations and queries.
> It would also be nice to have either a log of queries which generate the most
> garbage so operators can track this. Also a histogram.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)