Single slow node dramatically reduces cluster write throughput regardless of CL

Sarisky, Dan Wed, 14 Dec 2022 11:20:39 -0800

We issue writes to Cassandra as logged batches(RF=3, Consistencylevels=TWO, QUORUM, or LOCAL_QUORUM)

On clusters of any size - a single extremely slow node causes a ~90%loss of cluster-wide throughput using batched writes. We can replicatethis in the lab via CPU or disk throttling. I observe this in 3.11,4.0, and 4.1.


It appears the mechanism in play is:

Those logged batches are immediately written to two replica nodes andthe actual mutations aren't processed until those two nodes acknowledgethe batch statements. Those replica nodes are selected randomly fromall nodes in the local data center currently up in gossip. If a singlenode is slow, but still thought to be up in gossip, this eventuallycauses every other node to have all of its MutationStages to be waitingwhile the slow replica accepts batch writes.


The code in play appears to be:

Seehttps://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245.In the method filterBatchlogEndpoints() there is aCollections.shuffle() to order the endpoints and aFailureDetector.isEndpointAlive() to test if the endpoint is acceptable.

This behavior causes Cassandra to move from a multi-node fault tolerantsystem toa collection of single points of failure.

We try to take administrator actions to kill off the extremely slownodes, but it would be great to have some notion of "what node is a badchoice" when writing log batches to replica nodes.

Single slow node dramatically reduces cluster write throughput regardless of CL

Reply via email to