Cassandra operation success ratio survey results

2010-09-21 Thread Juho Mäkinen
It's known that compaction hurts the node performance so that it might
miss some requests. That's why it's important to handle these
situations and the client needs to retry the operation into another
working host. We have been storing performance data from each
cassandra request which we do into our five node cassandra production
cluster.

We log the retry count and request type into our data warehouse
solution and I've now extracted the data from a 10 day period and
calculated how many retry requests is needed so that the results can
be obtained. The following chart tells how many time an operation had
to be retried until it was successfully completed. The percents tells
the probability like that the request will be successful with the
first try in 99.933 % times.

Total amount of operations: 94 682 251 within 10 days.

Retry times | operations | percentage from total operations
  0 |  94618468  | 99.93263 %
  1 | 56688  |  0.05987 %
  2 |  5018  |  0.00529 %
  3 |  1359  |  0.00144 %
  4 |   111  |  0.00012 %
  5 |   25   |  0.3 %

There were also few operations which needed more than five retries, so
preparing to try up to ten times is not a bad idea.

The cluster users 0.6.5 with RF=3. Each operation is executed until it
succeeds or until 10 retries using this php wrapper
http://github.com/dynamoid/cassandra-utilities

Have others found similar results? Please discuss :)

 - Juho Mäkinen


Re: Cassandra operation success ratio survey results

2010-09-21 Thread Morten Wegelbye Nissen

 On 21-09-2010 15:29, Juho Mäkinen wrote:

It's known that compaction hurts the node performance so that it might
miss some requests. That's why it's important to handle these
situations and the client needs to retry the operation into another
working host. We have been storing performance data from each
cassandra request which we do into our five node cassandra production
cluster.
This is not the main topic of this mail, but, in what way do a client 
detect the performance issue here? I guess that somehow the client does 
not get a answer.


./Morten


Re: Cassandra operation success ratio survey results

2010-09-21 Thread Juho Mäkinen
The standard thrift php client detects the problem by normal timeout
which triggers a TException (Thrift Exception) which indicates that
request timeouted, or the (in)famous timed out reading 4 bytes from
host. These errors are catched on my php wrapper
[http://github.com/dynamoid/cassandra-utilities] and it retries the
operation inside a while loop into another server. The timeout setting
is provided by user, we are currently using one second, but we'll
propably make it shorter in the future and increment the timeout for
the request in case the request timeouted on first server and also by
request type (get_slice takes longer than get_column)

 - Juho Mäkinen

On Tue, Sep 21, 2010 at 5:56 PM, Morten Wegelbye Nissen m...@monit.dk wrote:
  On 21-09-2010 15:29, Juho Mäkinen wrote:

 It's known that compaction hurts the node performance so that it might
 miss some requests. That's why it's important to handle these
 situations and the client needs to retry the operation into another
 working host. We have been storing performance data from each
 cassandra request which we do into our five node cassandra production
 cluster.

 This is not the main topic of this mail, but, in what way do a client detect
 the performance issue here? I guess that somehow the client does not get a
 answer.

 ./Morten



Re: Cassandra operation success ratio survey results

2010-09-21 Thread Brandon Williams
On Tue, Sep 21, 2010 at 8:29 AM, Juho Mäkinen juho.maki...@gmail.comwrote:

 It's known that compaction hurts the node performance so that it might
 miss some requests. That's why it's important to handle these
 situations and the client needs to retry the operation into another
 working host. We have been storing performance data from each
 cassandra request which we do into our five node cassandra production
 cluster.


You should enable the dynamic snitch, since it's designed to route around
nodes that are not performing well, due to compaction or other problems:
https://issues.apache.org/jira/browse/CASSANDRA-981

Also, reducing the compaction thread's priority is a good idea:
https://issues.apache.org/jira/browse/CASSANDRA-1181

These are on by default in 0.7, but require additional JVM args in 0.6

-Brandon


Re: Cassandra operation success ratio survey results

2010-09-21 Thread Aaron Morton
Thanks for this, really interesting stuff.Just to make sure I'munderstandingit, this is for PHP clients witha 1 second timeout and retry is to a differentnode in the cluster with the same timeout.Have you enabled the Dynamic Snitch ?http://www.riptano.com/blog/whats-new-cassandra-065AaronOn 22 Sep, 2010,at 01:29 AM, Juho Mäkinen juho.maki...@gmail.com wrote:It's known that compaction hurts the node performance so that it might
miss some requests. That's why it's important to handle these
situations and the client needs to retry the operation into another
working host. We have been storing performance data from each
cassandra request which we do into our five node cassandra production
cluster.

We log the retry count and request type into our data warehouse
solution and I've now extracted the data from a 10 day period and
calculated how many retry requests is needed so that the results can
be obtained. The following chart tells how many time an operation had
to be retried until it was successfully completed. The percents tells
the probability like that "the request will be successful with the
first try in 99.933 % times."

Total amount of operations: 94 682 251 within 10 days.

Retry times | operations | percentage from total operations
  0 |  94618468  | 99.93263 %
  1 | 56688  |  0.05987 %
  2 |  5018  |  0.00529 %
  3 |  1359  |  0.00144 %
  4 |   111  |  0.00012 %
  5 |   25   |  0.3 %

There were also few operations which needed more than five retries, so
preparing to try up to ten times is not a bad idea.

The cluster users 0.6.5 with RF=3. Each operation is executed until it
succeeds or until 10 retries using this php wrapper
http://github.com/dynamoid/cassandra-utilities

Have others found similar results? Please discuss :)

 - Juho Mäkinen