Jeffrey F. Lukman created CASSANDRA-11724:
---------------------------------------------

             Summary: False Failure Detection in Big Cassandra Cluster
                 Key: CASSANDRA-11724
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Jeffrey F. Lukman
             Fix For: 2.2.5
         Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, Workload4.jpg

We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
setting in our testing is that each machine runs 8 cassandra instances, and our 
testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default 
number of vnodes for each instance which is 256. The data and log directories 
are on in-memory tmpfs file system.
We run several types of workloads on this Cassandra cluster:
Workload1: Just start the cluster
Workload2: Start half of the cluster, wait until it gets into a stable 
condition, and run another half of the cluster
Workload3: Start half of the cluster, wait until it gets into a stable 
condition, load some data, and run another half of the cluster
Workload4: Start the cluster, wait until it gets into a stable condition and 
decommission one node

For this testing, we measure the total numbers of false failure detection 
inside the cluster. By false failure detection, we mean that, for example, 
instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
deeper into the root cause and find out that instance-1 has not received any 
heartbeat after some time from instance-2 because the instance-2 run a long 
computation process.

Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to