Hi,

I’ve been having a problem with 3 neighboring nodes in our cluster having their 
read latencies jump up to 9000ms - 18000ms for a few minutes (as reported by 
opscenter), then come back down.

We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with cassandra 
reading and writing to 2 raided ssds.

I’ve added 2 nodes to the struggling part of the cluster, and aside from the 
latency spikes shifting onto the new nodes, it has had no effect. I suspect 
that a single key that lives on the first stressed node may be being read from 
heavily.

The spikes in latency don’t seem to be correlated to an increase in reads. The 
cluster’s workload is usually handling a maximum workload of 4200 reads/sec per 
node, with writes being significantly less, at ~200/sec per node. Usually it 
will be fine with this, with read latencies at around 3.5-10 ms/read, but once 
or twice an hour the latencies on the 3 nodes will shoot through the roof. 

The disks aren’t showing serious use, with read and write rates on the ssd 
volume at around 1350 kBps and 3218 kBps, respectively. Each cassandra process 
is maintaining 1000-1100 open connections. GC logs aren’t showing any serious 
gc pauses.

Any ideas on what might be causing this?

Thanks,

Blake

Reply via email to