Dan Sarisky created CASSANDRA-18120:
---------------------------------------
Summary: Single slow node dramatically reduces cluster write
throughput regardless of CL
Key: CASSANDRA-18120
URL: https://issues.apache.org/jira/browse/CASSANDRA-18120
Project: Cassandra
Issue Type: Improvement
Reporter: Dan Sarisky
We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO,
QUORUM, or LOCAL_QUORUM)
On clusters of any size - a single extremely slow node causes a ~90% loss of
cluster-wide throughput using batched writes. We can replicate this in the lab
via CPU or disk throttling. I observe this in 3.11, 4.0, and 4.1.
It appears the mechanism in play is:
Those logged batches are immediately written to two replica nodes and the
actual mutations aren't processed until those two nodes acknowledge the batch
statements. Those replica nodes are selected randomly from all nodes in the
local data center currently up in gossip. If a single node is slow, but still
thought to be up in gossip, this eventually causes every other node to have all
of its MutationStages to be waiting while the slow replica accepts batch writes.
The code in play appears to be:
See
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245].
In the method filterBatchlogEndpoints() there is a
Collections.shuffle() to order the endpoints and a
FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
This behavior causes Cassandra to move from a multi-node fault tolerant system
toa collection of single points of failure.
We try to take administrator actions to kill off the extremely slow nodes, but
it would be great to have some notion of "what node is a bad choice" when
writing log batches to replica nodes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]