We issue writes to Cassandra as logged batches(RF=3, Consistency
levels=TWO, QUORUM, or LOCAL_QUORUM)
On clusters of any size - a single extremely slow node causes a ~90%
loss of cluster-wide throughput using batched writes. We can replicate
this in the lab via CPU or disk throttling. I observe this in 3.11,
4.0, and 4.1.
It appears the mechanism in play is:
Those logged batches are immediately written to two replica nodes and
the actual mutations aren't processed until those two nodes acknowledge
the batch statements. Those replica nodes are selected randomly from
all nodes in the local data center currently up in gossip. If a single
node is slow, but still thought to be up in gossip, this eventually
causes every other node to have all of its MutationStages to be waiting
while the slow replica accepts batch writes.
The code in play appears to be:
See
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245.
In the method filterBatchlogEndpoints() there is a
Collections.shuffle() to order the endpoints and a
FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
This behavior causes Cassandra to move from a multi-node fault tolerant
system toa collection of single points of failure.
We try to take administrator actions to kill off the extremely slow
nodes, but it would be great to have some notion of "what node is a bad
choice" when writing log batches to replica nodes.