[
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277100#comment-15277100
]
Jeffrey F. Lukman commented on CASSANDRA-11724:
-----------------------------------------------
[~jeromatron] : okay, I will try this again and report the result later whether
this config
will cause a different result or not.
For now, can you help me by confirming whether you also see the Workload-4 bug
or not?
The Workload-4 : running 512-nodes cluster with some data, then we
decommissioned a node.
In our place, we see a high numbers of wrong false failure detection.
> False Failure Detection in Big Cassandra Cluster
> ------------------------------------------------
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Jeffrey F. Lukman
> Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg,
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The
> setting in our testing is that each machine has 16-cores and runs 8 cassandra
> instances, and our testing is 32, 64, 128, 256, and 512 instances of
> Cassandra. We use the default number of vnodes for each instance which is
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition,
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection
> inside the cluster. By false failure detection, we mean that, for example,
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig
> deeper into the root cause and find out that instance-1 has not received any
> heartbeat after some time from instance-2 because the instance-2 run a long
> computation process.
> Here I attach the graphs of each workload result.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)