[jira] [Commented] (CASSANDRA-18120) Single slow node dramatically reduces cluster write throughput regardless of CL

Maxim Chanturiay (Jira) Tue, 13 Jun 2023 14:58:44 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732243#comment-17732243
 ]


Maxim Chanturiay commented on CASSANDRA-18120:
----------------------------------------------

Hello everyone! Thank you for the replies and guidelines. They have taken me 
across interesting journey over the code.

It's been a while on my part - however, I do have something concrete to review 
- [https://github.com/apache/cassandra/pull/2415].
This is by no means final version or even approach, but it illustrates the 
solution and more importantly can be compiled to perform tests on a more 
sophisticated environment than my local 3 containers.
Unfortunately, I couldn't setup a datacenter with more than 3 nodes running at 
the same time. Any additional container causes running cassandra process on 
previous containers to shut down. I've failed to figure out the reason :(
However, as far as unit test goes (`BatchlogEndpointFilterByResponseTimeTest`) 
- the change seems to reliably pick faster nodes given that PHI values exist.

[~dsarisky] is there a chance that you can compile a branch with the fix over 
at your environment and test it out?
This is the link to the branch with the fix 
[https://github.com/maxim-chn/cassandra/tree/18120-4.1].
There are 2 important use cases (covered by new unit test 
`BatchlogEndpointFilterByResponseTimeTest`):
 - A local and non-local rack where non-local rack has more than 2 potential 
replica nodes. Or you can use a single rack with more than 3 nodes.
 - A local and more than 2 non-local racks. Each non-local rack should have at 
least 2 potential replica nodes.

Once replica node or nodes are CPU overloaded, for example, my estimation is 
that it should take less than 30 seconds to summarize enough data into PHI 
values so that slower nodes are not picked at all.

+Why not go with Dynamic Snitch?+
I support [~mck]'s comment, especially that we would like to avoid any change 
that will cause an action on behalf of the end users.
They will probably keep operation as usual and ignore such a change altogether 
:D.

+Why not go with `sortedByProximity()`?+
First reason. It turns out that it's not implemented even in a similar way 
across descendants of `AbstractEndpointSnitch`. In fact, 
`DynamicEndpointSnitch` throws an error upon method call and urges to go to its 
alternatives. In short, the complexity of making sure that every 
`AbstractEndpointSnitch` descendant has the proper implementation of the method 
and current behavior is not "broken" is too high. A bug will surely be a result 
here. Maybe a couple of bugs :).

Second reason. The method's blueprint states that `<C extends 
ReplicaCollection<? extends C>> C` is the material for sorting.
This is easily obtainable when we perform a write operation on a single 
keyspace through 
`keyspace.getReplicationStrategy().getNaturalReplicasForToken()`.
However, in case of batch operations which potentially involves different 
keyspaces - this is problematic.
Trying to avoid `getNaturalReplicasForToken()` results in creating `Replica` 
objects manually from a `List<InetAddressAndPort>` or something along these 
lines. `Replica` class has a pretty convincing comment to avoid such scenarios 
by all means.

> Single slow node dramatically reduces cluster write throughput regardless of 
> CL
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18120
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18120
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Dan Sarisky
>            Assignee: Maxim Chanturiay
>            Priority: Normal
>
> We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO, 
> QUORUM, or LOCAL_QUORUM)
>  
> On clusters of any size - a single extremely slow node causes a ~90% loss of 
> cluster-wide throughput using batched writes.  We can replicate this in the 
> lab via CPU or disk throttling.  I observe this in 3.11, 4.0, and 4.1.
>  
> It appears the mechanism in play is:
> Those logged batches are immediately written to two replica nodes and the 
> actual mutations aren't processed until those two nodes acknowledge the batch 
> statements.  Those replica nodes are selected randomly from all nodes in the 
> local data center currently up in gossip.  If a single node is slow, but 
> still thought to be up in gossip, this eventually causes every other node to 
> have all of its MutationStages to be waiting while the slow replica accepts 
> batch writes.
>  
> The code in play appears to be:
> See
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245].
>   In the method filterBatchlogEndpoints() there is a
> Collections.shuffle() to order the endpoints and a
> FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
>  
> This behavior causes Cassandra to move from a multi-node fault tolerant 
> system toa collection of single points of failure.
>  
> We try to take administrator actions to kill off the extremely slow nodes, 
> but it would be great to have some notion of "what node is a bad choice" when 
> writing log batches to replica nodes.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-18120) Single slow node dramatically reduces cluster write throughput regardless of CL

Reply via email to