[
https://issues.apache.org/jira/browse/CASSANDRA-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712942#comment-17712942
]
Michael Semb Wever commented on CASSANDRA-18120:
------------------------------------------------
bq. I am starting to work on the patch that uses function `sortedByProximity()`
instead of the Collection's shuffle.
[~maximc], do we see a similar behaviour with >ONE reads when dynamic snitch is
disabled ? `sortedByProximity()` is different to dynamic snitch's
implementation of it, and many users with correctly set up clusters on
dedicated modern hardware are disabling the dynamic snitch as it only gets in
the way and reduces optimal performance. (That's not to say that enabling it
again if they hit this problem will be valid advice to give them.)
(ref: [slack
discussion|https://the-asf.slack.com/archives/CK23JSY2K/p1681397339884839])
bq. But if you want to code it - I'm happy to run a before/after build to
provide some data on whether it solves the customer level problem.
[~dsarisky], yes please. We are all very swamped atm, and really would
appreciate the help.
> Single slow node dramatically reduces cluster write throughput regardless of
> CL
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-18120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18120
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Dan Sarisky
> Assignee: Maxim Chanturiay
> Priority: Normal
>
> We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO,
> QUORUM, or LOCAL_QUORUM)
>
> On clusters of any size - a single extremely slow node causes a ~90% loss of
> cluster-wide throughput using batched writes. We can replicate this in the
> lab via CPU or disk throttling. I observe this in 3.11, 4.0, and 4.1.
>
> It appears the mechanism in play is:
> Those logged batches are immediately written to two replica nodes and the
> actual mutations aren't processed until those two nodes acknowledge the batch
> statements. Those replica nodes are selected randomly from all nodes in the
> local data center currently up in gossip. If a single node is slow, but
> still thought to be up in gossip, this eventually causes every other node to
> have all of its MutationStages to be waiting while the slow replica accepts
> batch writes.
>
> The code in play appears to be:
> See
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245].
> In the method filterBatchlogEndpoints() there is a
> Collections.shuffle() to order the endpoints and a
> FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
>
> This behavior causes Cassandra to move from a multi-node fault tolerant
> system toa collection of single points of failure.
>
> We try to take administrator actions to kill off the extremely slow nodes,
> but it would be great to have some notion of "what node is a bad choice" when
> writing log batches to replica nodes.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]