Cassandra operation success ratio survey results
It's known that compaction hurts the node performance so that it might miss some requests. That's why it's important to handle these situations and the client needs to retry the operation into another working host. We have been storing performance data from each cassandra request which we do into our five node cassandra production cluster. We log the retry count and request type into our data warehouse solution and I've now extracted the data from a 10 day period and calculated how many retry requests is needed so that the results can be obtained. The following chart tells how many time an operation had to be retried until it was successfully completed. The percents tells the probability like that the request will be successful with the first try in 99.933 % times. Total amount of operations: 94 682 251 within 10 days. Retry times | operations | percentage from total operations 0 | 94618468 | 99.93263 % 1 | 56688 | 0.05987 % 2 | 5018 | 0.00529 % 3 | 1359 | 0.00144 % 4 | 111 | 0.00012 % 5 | 25 | 0.3 % There were also few operations which needed more than five retries, so preparing to try up to ten times is not a bad idea. The cluster users 0.6.5 with RF=3. Each operation is executed until it succeeds or until 10 retries using this php wrapper http://github.com/dynamoid/cassandra-utilities Have others found similar results? Please discuss :) - Juho Mäkinen
Re: Cassandra operation success ratio survey results
On 21-09-2010 15:29, Juho Mäkinen wrote: It's known that compaction hurts the node performance so that it might miss some requests. That's why it's important to handle these situations and the client needs to retry the operation into another working host. We have been storing performance data from each cassandra request which we do into our five node cassandra production cluster. This is not the main topic of this mail, but, in what way do a client detect the performance issue here? I guess that somehow the client does not get a answer. ./Morten
Re: Cassandra operation success ratio survey results
The standard thrift php client detects the problem by normal timeout which triggers a TException (Thrift Exception) which indicates that request timeouted, or the (in)famous timed out reading 4 bytes from host. These errors are catched on my php wrapper [http://github.com/dynamoid/cassandra-utilities] and it retries the operation inside a while loop into another server. The timeout setting is provided by user, we are currently using one second, but we'll propably make it shorter in the future and increment the timeout for the request in case the request timeouted on first server and also by request type (get_slice takes longer than get_column) - Juho Mäkinen On Tue, Sep 21, 2010 at 5:56 PM, Morten Wegelbye Nissen m...@monit.dk wrote: On 21-09-2010 15:29, Juho Mäkinen wrote: It's known that compaction hurts the node performance so that it might miss some requests. That's why it's important to handle these situations and the client needs to retry the operation into another working host. We have been storing performance data from each cassandra request which we do into our five node cassandra production cluster. This is not the main topic of this mail, but, in what way do a client detect the performance issue here? I guess that somehow the client does not get a answer. ./Morten
Re: Cassandra operation success ratio survey results
On Tue, Sep 21, 2010 at 8:29 AM, Juho Mäkinen juho.maki...@gmail.comwrote: It's known that compaction hurts the node performance so that it might miss some requests. That's why it's important to handle these situations and the client needs to retry the operation into another working host. We have been storing performance data from each cassandra request which we do into our five node cassandra production cluster. You should enable the dynamic snitch, since it's designed to route around nodes that are not performing well, due to compaction or other problems: https://issues.apache.org/jira/browse/CASSANDRA-981 Also, reducing the compaction thread's priority is a good idea: https://issues.apache.org/jira/browse/CASSANDRA-1181 These are on by default in 0.7, but require additional JVM args in 0.6 -Brandon
Re: Cassandra operation success ratio survey results
Thanks for this, really interesting stuff.Just to make sure I'munderstandingit, this is for PHP clients witha 1 second timeout and retry is to a differentnode in the cluster with the same timeout.Have you enabled the Dynamic Snitch ?http://www.riptano.com/blog/whats-new-cassandra-065AaronOn 22 Sep, 2010,at 01:29 AM, Juho Mäkinen juho.maki...@gmail.com wrote:It's known that compaction hurts the node performance so that it might miss some requests. That's why it's important to handle these situations and the client needs to retry the operation into another working host. We have been storing performance data from each cassandra request which we do into our five node cassandra production cluster. We log the retry count and request type into our data warehouse solution and I've now extracted the data from a 10 day period and calculated how many retry requests is needed so that the results can be obtained. The following chart tells how many time an operation had to be retried until it was successfully completed. The percents tells the probability like that "the request will be successful with the first try in 99.933 % times." Total amount of operations: 94 682 251 within 10 days. Retry times | operations | percentage from total operations 0 | 94618468 | 99.93263 % 1 | 56688 | 0.05987 % 2 | 5018 | 0.00529 % 3 | 1359 | 0.00144 % 4 | 111 | 0.00012 % 5 | 25 | 0.3 % There were also few operations which needed more than five retries, so preparing to try up to ten times is not a bad idea. The cluster users 0.6.5 with RF=3. Each operation is executed until it succeeds or until 10 retries using this php wrapper http://github.com/dynamoid/cassandra-utilities Have others found similar results? Please discuss :) - Juho Mäkinen