[
https://issues.apache.org/jira/browse/CASSANDRA-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Burroughs updated CASSANDRA-8035:
---------------------------------------
Description:
Running repair causes a significnat increase in client latency even when the
total amount of data per node is very small.
Each node 900 req/s and during normal operations the 99p Client Request Lantecy
is less than 4 ms and usually less than 1ms. During repair the latency
increases to within 4-10ms on all nodes. I am unable to find any resource
based explantion for this. Several graphs are attached to summarize. Repair
started at about 10:10 and finished around 10:25.
* Client Request Latency goes up significantly.
* Local keyspace read latency is flat. I interpret this to mean that it's
purly coordinator overhead that's causing the slowdown.
* Row cache hit rate is unaffected ( and is very high). Between these two
metrics I don't think there is any doubt that virtually all reads are being
satisfied in memory.
* There is plenty of available cpu. Aggregate cpu used (mostly nic) did go up
during this.
Having more/larger keyspaces seems to make it worse. Having two keyspaces on
this cluster (still with total size << RAM) caused larger increases in latency
which would have made for better graphs but it pushed the cluster well outsid
of SLAs and we needed to move the second keyspace.
was:
Running repair causes a significnat increase in client latency even when the
total amount of data per node is very small.
Each node 900 req/s and during normal operations the 99p Client Request Lantecy
is less than 4 ms and usually less than 1ms. During repair the latency
increases to within 4-10ms on all nodes. I am unable to find any resource
based explantion for this. Several graphs are attached to summarize.
* Client Request Latency goes up significantly.
* Local keyspace read latency is flat. I interpret this to mean that it's
purly coordinator overhead that's causing the slowdown.
* Row cache hit rate is unaffected ( and is very high). Between these two
metrics I don't think there is any doubt that virtually all reads are being
satisfied in memory.
* There is plenty of available cpu. Aggregate cpu used (mostly nic) did go up
during this.
Having more/larger keyspaces seems to make it worse. Having two keyspaces on
this cluster (still with total size << RAM) caused larger increases in latency
which would have made for better graphs but it pushed the cluster well outsid
of SLAs and we needed to move the second keyspace.
> 2.0.x repair causes large increasein client latency even for small datasets
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-8035
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8035
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: c-2.0.10, 3 nodes per @ DCs. Load < 50 MB
> Reporter: Chris Burroughs
> Attachments: cl-latency.png, cpu-idle.png, keyspace-99p.png,
> row-cache-hit-rate.png
>
>
> Running repair causes a significnat increase in client latency even when the
> total amount of data per node is very small.
> Each node 900 req/s and during normal operations the 99p Client Request
> Lantecy is less than 4 ms and usually less than 1ms. During repair the
> latency increases to within 4-10ms on all nodes. I am unable to find any
> resource based explantion for this. Several graphs are attached to
> summarize. Repair started at about 10:10 and finished around 10:25.
> * Client Request Latency goes up significantly.
> * Local keyspace read latency is flat. I interpret this to mean that it's
> purly coordinator overhead that's causing the slowdown.
> * Row cache hit rate is unaffected ( and is very high). Between these two
> metrics I don't think there is any doubt that virtually all reads are being
> satisfied in memory.
> * There is plenty of available cpu. Aggregate cpu used (mostly nic) did go
> up during this.
> Having more/larger keyspaces seems to make it worse. Having two keyspaces on
> this cluster (still with total size << RAM) caused larger increases in
> latency which would have made for better graphs but it pushed the cluster
> well outsid of SLAs and we needed to move the second keyspace.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)