[ https://issues.apache.org/jira/browse/CASSANDRA-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Burroughs updated CASSANDRA-8035: --------------------------------------- Description: Running repair causes a significnat increase in client latency even when the total amount of data per node is very small. Each node 900 req/s and during normal operations the 99p Client Request Lantecy is less than 4 ms and usually less than 1ms. During repair the latency increases to within 4-10ms on all nodes. I am unable to find any resource based explantion for this. Several graphs are attached to summarize. Repair started at about 10:10 and finished around 10:25. * Client Request Latency goes up significantly. * Local keyspace read latency is flat. I interpret this to mean that it's purly coordinator overhead that's causing the slowdown. * Row cache hit rate is unaffected ( and is very high). Between these two metrics I don't think there is any doubt that virtually all reads are being satisfied in memory. * There is plenty of available cpu. Aggregate cpu used (mostly nic) did go up during this. Having more/larger keyspaces seems to make it worse. Having two keyspaces on this cluster (still with total size << RAM) caused larger increases in latency which would have made for better graphs but it pushed the cluster well outsid of SLAs and we needed to move the second keyspace. was: Running repair causes a significnat increase in client latency even when the total amount of data per node is very small. Each node 900 req/s and during normal operations the 99p Client Request Lantecy is less than 4 ms and usually less than 1ms. During repair the latency increases to within 4-10ms on all nodes. I am unable to find any resource based explantion for this. Several graphs are attached to summarize. * Client Request Latency goes up significantly. * Local keyspace read latency is flat. I interpret this to mean that it's purly coordinator overhead that's causing the slowdown. * Row cache hit rate is unaffected ( and is very high). Between these two metrics I don't think there is any doubt that virtually all reads are being satisfied in memory. * There is plenty of available cpu. Aggregate cpu used (mostly nic) did go up during this. Having more/larger keyspaces seems to make it worse. Having two keyspaces on this cluster (still with total size << RAM) caused larger increases in latency which would have made for better graphs but it pushed the cluster well outsid of SLAs and we needed to move the second keyspace. > 2.0.x repair causes large increasein client latency even for small datasets > --------------------------------------------------------------------------- > > Key: CASSANDRA-8035 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8035 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: c-2.0.10, 3 nodes per @ DCs. Load < 50 MB > Reporter: Chris Burroughs > Attachments: cl-latency.png, cpu-idle.png, keyspace-99p.png, > row-cache-hit-rate.png > > > Running repair causes a significnat increase in client latency even when the > total amount of data per node is very small. > Each node 900 req/s and during normal operations the 99p Client Request > Lantecy is less than 4 ms and usually less than 1ms. During repair the > latency increases to within 4-10ms on all nodes. I am unable to find any > resource based explantion for this. Several graphs are attached to > summarize. Repair started at about 10:10 and finished around 10:25. > * Client Request Latency goes up significantly. > * Local keyspace read latency is flat. I interpret this to mean that it's > purly coordinator overhead that's causing the slowdown. > * Row cache hit rate is unaffected ( and is very high). Between these two > metrics I don't think there is any doubt that virtually all reads are being > satisfied in memory. > * There is plenty of available cpu. Aggregate cpu used (mostly nic) did go > up during this. > Having more/larger keyspaces seems to make it worse. Having two keyspaces on > this cluster (still with total size << RAM) caused larger increases in > latency which would have made for better graphs but it pushed the cluster > well outsid of SLAs and we needed to move the second keyspace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)