Re: massive spikes in read latency

2014-01-09 Thread Aaron Morton
 The spikes in latency don’t seem to be correlated to an increase in reads. 
 The cluster’s workload is usually handling a maximum workload of 4200 
 reads/sec per node, with writes being significantly less, at ~200/sec per 
 node. Usually it will be fine with this, with read latencies at around 3.5-10 
 ms/read, but once or twice an hour the latencies on the 3 nodes will shoot 
 through the roof.
Could there be errant requests coming in from the app ? e.g. something asking 
for 1’000s of columns ?
Or something that hits a row that has a lot of tombstones ? 

Take a look at nodetool cfhistograms to see if you have any outlier wide rows. 
Also the second column, sstables, will tell you how many sstables were touched 
by reads which. High numbers, above 4, let you know there are some wide rows 
out there.  

In 2.0 and later 1.2 releases nodetool cfstats will also include information 
about the number tombstones touched in a read. 

Hope that helps. 

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 8/01/2014, at 2:15 am, Jason Wee peich...@gmail.com wrote:

 /**
  * Verbs it's okay to drop if the request has been queued longer than the 
 request timeout.  These
  * all correspond to client requests or something triggered by them; we 
 don't want to
  * drop internal messages like bootstrap or repair notifications.
  */
 public static final EnumSetVerb DROPPABLE_VERBS = 
 EnumSet.of(Verb.BINARY,

 Verb._TRACE,

 Verb.MUTATION,

 Verb.READ_REPAIR,
Verb.READ,

 Verb.RANGE_SLICE,

 Verb.PAGED_RANGE,

 Verb.REQUEST_RESPONSE);
 
 
 The short term solution would probably increase the timeout in your yaml file 
 but i suggest you get the monitoring graphs (ping internode, block io) ready 
 so it will give better indication which might be the exact problem.
 
 Jason
 
 
 On Tue, Jan 7, 2014 at 2:30 AM, Blake Eggleston bl...@shift.com wrote:
 That’s a good point. CPU steal time is very low, but I haven’t observed 
 internode ping times during one of the peaks, I’ll have to check that out. 
 Another thing I’ve noticed is that cassandra starts dropping read messages 
 during the spikes, as reported by tpstats. This indicates that there’s too 
 many queries for cassandra to handle. However, as I mentioned earlier, the 
 spikes aren’t correlated to an increase in reads.
 
 On Jan 5, 2014, at 3:28 PM, Blake Eggleston bl...@shift.com wrote:
 
  Hi,
 
  I’ve been having a problem with 3 neighboring nodes in our cluster having 
  their read latencies jump up to 9000ms - 18000ms for a few minutes (as 
  reported by opscenter), then come back down.
 
  We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with 
  cassandra reading and writing to 2 raided ssds.
 
  I’ve added 2 nodes to the struggling part of the cluster, and aside from 
  the latency spikes shifting onto the new nodes, it has had no effect. I 
  suspect that a single key that lives on the first stressed node may be 
  being read from heavily.
 
  The spikes in latency don’t seem to be correlated to an increase in reads. 
  The cluster’s workload is usually handling a maximum workload of 4200 
  reads/sec per node, with writes being significantly less, at ~200/sec per 
  node. Usually it will be fine with this, with read latencies at around 
  3.5-10 ms/read, but once or twice an hour the latencies on the 3 nodes will 
  shoot through the roof.
 
  The disks aren’t showing serious use, with read and write rates on the ssd 
  volume at around 1350 kBps and 3218 kBps, respectively. Each cassandra 
  process is maintaining 1000-1100 open connections. GC logs aren’t showing 
  any serious gc pauses.
 
  Any ideas on what might be causing this?
 
  Thanks,
 
  Blake
 
 



Re: massive spikes in read latency

2014-01-07 Thread Jason Wee
/**
 * Verbs it's okay to drop if the request has been queued longer than
the request timeout.  These
 * all correspond to client requests or something triggered by them; we
don't want to
 * drop internal messages like bootstrap or repair notifications.
 */
public static final EnumSetVerb DROPPABLE_VERBS =
EnumSet.of(Verb.BINARY,

 Verb._TRACE,

 Verb.MUTATION,

 Verb.READ_REPAIR,

 Verb.READ,

 Verb.RANGE_SLICE,

 Verb.PAGED_RANGE,

 Verb.REQUEST_RESPONSE);


The short term solution would probably increase the timeout in your yaml
file but i suggest you get the monitoring graphs (ping internode, block io)
ready so it will give better indication which might be the exact problem.

Jason


On Tue, Jan 7, 2014 at 2:30 AM, Blake Eggleston bl...@shift.com wrote:

 That’s a good point. CPU steal time is very low, but I haven’t observed
 internode ping times during one of the peaks, I’ll have to check that out.
 Another thing I’ve noticed is that cassandra starts dropping read messages
 during the spikes, as reported by tpstats. This indicates that there’s too
 many queries for cassandra to handle. However, as I mentioned earlier, the
 spikes aren’t correlated to an increase in reads.

 On Jan 5, 2014, at 3:28 PM, Blake Eggleston bl...@shift.com wrote:

  Hi,
 
  I’ve been having a problem with 3 neighboring nodes in our cluster
 having their read latencies jump up to 9000ms - 18000ms for a few minutes
 (as reported by opscenter), then come back down.
 
  We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with
 cassandra reading and writing to 2 raided ssds.
 
  I’ve added 2 nodes to the struggling part of the cluster, and aside from
 the latency spikes shifting onto the new nodes, it has had no effect. I
 suspect that a single key that lives on the first stressed node may be
 being read from heavily.
 
  The spikes in latency don’t seem to be correlated to an increase in
 reads. The cluster’s workload is usually handling a maximum workload of
 4200 reads/sec per node, with writes being significantly less, at ~200/sec
 per node. Usually it will be fine with this, with read latencies at around
 3.5-10 ms/read, but once or twice an hour the latencies on the 3 nodes will
 shoot through the roof.
 
  The disks aren’t showing serious use, with read and write rates on the
 ssd volume at around 1350 kBps and 3218 kBps, respectively. Each cassandra
 process is maintaining 1000-1100 open connections. GC logs aren’t showing
 any serious gc pauses.
 
  Any ideas on what might be causing this?
 
  Thanks,
 
  Blake




Re: massive spikes in read latency

2014-01-06 Thread Jason Wee
Hi, could it be due to having noisy neighbour? Do you have graphs
statistics ping between nodes?

Jason


On Mon, Jan 6, 2014 at 7:28 AM, Blake Eggleston bl...@shift.com wrote:

 Hi,

 I’ve been having a problem with 3 neighboring nodes in our cluster having
 their read latencies jump up to 9000ms - 18000ms for a few minutes (as
 reported by opscenter), then come back down.

 We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with
 cassandra reading and writing to 2 raided ssds.

 I’ve added 2 nodes to the struggling part of the cluster, and aside from
 the latency spikes shifting onto the new nodes, it has had no effect. I
 suspect that a single key that lives on the first stressed node may be
 being read from heavily.

 The spikes in latency don’t seem to be correlated to an increase in reads.
 The cluster’s workload is usually handling a maximum workload of 4200
 reads/sec per node, with writes being significantly less, at ~200/sec per
 node. Usually it will be fine with this, with read latencies at around
 3.5-10 ms/read, but once or twice an hour the latencies on the 3 nodes will
 shoot through the roof.

 The disks aren’t showing serious use, with read and write rates on the ssd
 volume at around 1350 kBps and 3218 kBps, respectively. Each cassandra
 process is maintaining 1000-1100 open connections. GC logs aren’t showing
 any serious gc pauses.

 Any ideas on what might be causing this?

 Thanks,

 Blake


Re: massive spikes in read latency

2014-01-06 Thread Blake Eggleston
That’s a good point. CPU steal time is very low, but I haven’t observed 
internode ping times during one of the peaks, I’ll have to check that out. 
Another thing I’ve noticed is that cassandra starts dropping read messages 
during the spikes, as reported by tpstats. This indicates that there’s too many 
queries for cassandra to handle. However, as I mentioned earlier, the spikes 
aren’t correlated to an increase in reads.

On Jan 5, 2014, at 3:28 PM, Blake Eggleston bl...@shift.com wrote:

 Hi,
 
 I’ve been having a problem with 3 neighboring nodes in our cluster having 
 their read latencies jump up to 9000ms - 18000ms for a few minutes (as 
 reported by opscenter), then come back down.
 
 We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with cassandra 
 reading and writing to 2 raided ssds.
 
 I’ve added 2 nodes to the struggling part of the cluster, and aside from the 
 latency spikes shifting onto the new nodes, it has had no effect. I suspect 
 that a single key that lives on the first stressed node may be being read 
 from heavily.
 
 The spikes in latency don’t seem to be correlated to an increase in reads. 
 The cluster’s workload is usually handling a maximum workload of 4200 
 reads/sec per node, with writes being significantly less, at ~200/sec per 
 node. Usually it will be fine with this, with read latencies at around 3.5-10 
 ms/read, but once or twice an hour the latencies on the 3 nodes will shoot 
 through the roof. 
 
 The disks aren’t showing serious use, with read and write rates on the ssd 
 volume at around 1350 kBps and 3218 kBps, respectively. Each cassandra 
 process is maintaining 1000-1100 open connections. GC logs aren’t showing any 
 serious gc pauses.
 
 Any ideas on what might be causing this?
 
 Thanks,
 
 Blake