[ 
https://issues.apache.org/jira/browse/CASSANDRA-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Burroughs updated CASSANDRA-8035:
---------------------------------------
    Description: 
Running repair causes a significnat increase in client latency even when the 
total amount of data per node is very small.

Each node 900 req/s and during normal operations the 99p Client Request Lantecy 
is less than 4 ms and usually less than 1ms.  During repair the latency 
increases to within 4-10ms on all nodes.  I am unable to find any resource 
based explantion for this.  Several graphs are attached to summarize.  Repair 
started at about 10:10 and finished around 10:25.

 * Client Request Latency goes up significantly.
 * Local keyspace read latency is flat.  I interpret this to mean that it's 
purly coordinator overhead that's causing the slowdown.
 * Row cache hit rate is unaffected ( and is very high).  Between these two 
metrics I don't think there is any doubt that virtually all reads are being 
satisfied in memory.
 * There is plenty of available cpu.  Aggregate cpu used (mostly nic) did go up 
during this.

Having more/larger keyspaces seems to make it worse.  Having two keyspaces on 
this cluster (still with total size << RAM) caused larger increases in latency 
which would have made for better graphs but it pushed the cluster well outsid 
of SLAs and we needed to move the second keyspace.

  was:
Running repair causes a significnat increase in client latency even when the 
total amount of data per node is very small.

Each node 900 req/s and during normal operations the 99p Client Request Lantecy 
is less than 4 ms and usually less than 1ms.  During repair the latency 
increases to within 4-10ms on all nodes.  I am unable to find any resource 
based explantion for this.  Several graphs are attached to summarize.

 * Client Request Latency goes up significantly.
 * Local keyspace read latency is flat.  I interpret this to mean that it's 
purly coordinator overhead that's causing the slowdown.
 * Row cache hit rate is unaffected ( and is very high).  Between these two 
metrics I don't think there is any doubt that virtually all reads are being 
satisfied in memory.
 * There is plenty of available cpu.  Aggregate cpu used (mostly nic) did go up 
during this.

Having more/larger keyspaces seems to make it worse.  Having two keyspaces on 
this cluster (still with total size << RAM) caused larger increases in latency 
which would have made for better graphs but it pushed the cluster well outsid 
of SLAs and we needed to move the second keyspace.


> 2.0.x repair causes large increasein client latency even for small datasets
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8035
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8035
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: c-2.0.10, 3 nodes per @ DCs.  Load < 50 MB
>            Reporter: Chris Burroughs
>         Attachments: cl-latency.png, cpu-idle.png, keyspace-99p.png, 
> row-cache-hit-rate.png
>
>
> Running repair causes a significnat increase in client latency even when the 
> total amount of data per node is very small.
> Each node 900 req/s and during normal operations the 99p Client Request 
> Lantecy is less than 4 ms and usually less than 1ms.  During repair the 
> latency increases to within 4-10ms on all nodes.  I am unable to find any 
> resource based explantion for this.  Several graphs are attached to 
> summarize.  Repair started at about 10:10 and finished around 10:25.
>  * Client Request Latency goes up significantly.
>  * Local keyspace read latency is flat.  I interpret this to mean that it's 
> purly coordinator overhead that's causing the slowdown.
>  * Row cache hit rate is unaffected ( and is very high).  Between these two 
> metrics I don't think there is any doubt that virtually all reads are being 
> satisfied in memory.
>  * There is plenty of available cpu.  Aggregate cpu used (mostly nic) did go 
> up during this.
> Having more/larger keyspaces seems to make it worse.  Having two keyspaces on 
> this cluster (still with total size << RAM) caused larger increases in 
> latency which would have made for better graphs but it pushed the cluster 
> well outsid of SLAs and we needed to move the second keyspace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to