[ https://issues.apache.org/jira/browse/CASSANDRA-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Semb Wever updated CASSANDRA-18120: ------------------------------------------- Change Category: Operability Complexity: Normal Component/s: Consistency/Coordination Fix Version/s: 4.0.x 4.1.x 5.0.x 5.x Reviewers: Michael Semb Wever Status: Open (was: Triage Needed) > Single slow node dramatically reduces cluster logged batch write throughput > regardless of CL > -------------------------------------------------------------------------------------------- > > Key: CASSANDRA-18120 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18120 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination > Reporter: Dan Sarisky > Assignee: Maxim Chanturiay > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO, > QUORUM, or LOCAL_QUORUM) > > On clusters of any size - a single extremely slow node causes a ~90% loss of > cluster-wide throughput using batched writes. We can replicate this in the > lab via CPU or disk throttling. I observe this in 3.11, 4.0, and 4.1. > > It appears the mechanism in play is: > Those logged batches are immediately written to two replica nodes and the > actual mutations aren't processed until those two nodes acknowledge the batch > statements. Those replica nodes are selected randomly from all nodes in the > local data center currently up in gossip. If a single node is slow, but > still thought to be up in gossip, this eventually causes every other node to > have all of its MutationStages to be waiting while the slow replica accepts > batch writes. > > The code in play appears to be: > See > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245]. > In the method filterBatchlogEndpoints() there is a > Collections.shuffle() to order the endpoints and a > FailureDetector.isEndpointAlive() to test if the endpoint is acceptable. > > This behavior causes Cassandra to move from a multi-node fault tolerant > system toa collection of single points of failure. > > We try to take administrator actions to kill off the extremely slow nodes, > but it would be great to have some notion of "what node is a bad choice" when > writing log batches to replica nodes. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org