[jira] [Commented] (CASSANDRA-18120) Single slow node dramatically reduces cluster write throughput regardless of CL

Shayne Hunsaker (Jira) Mon, 04 Mar 2024 13:49:06 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823365#comment-17823365
 ]


Shayne Hunsaker commented on CASSANDRA-18120:
---------------------------------------------

[~dsarisky] [~maximc] [~mck] I'm reviving this for more discussion.

We're working on a fix for this in our fork and ideally could contribute the 
solution to both, to keep them more in sync. We're looking into porting the 
[Datastax Enterprise 
bachlog_endpoint_strategy|https://docs.datastax.com/en/dse/6.8/docs/managing/configure/configure-cassandra-yaml.html#hints_compression],
 or something like it, to open source cassandra. 

This implementation uses DynamicEndpointSnitch.sortedByProximity(). To respond 
to your points in: +Why not go with `sortedByProximity()`?+
 # It's true that the implementation is different across every 
`AbstractEndpointSnitch` descendant, Datastax enterprise solution only works if 
the dynamic snitch is enabled. It checks that the snitch is of type 
`DynamicEndpointSnitch` and reverts back to the current Random implementation 
if it is not. So it avoids errors, but doesn't work in every scenario.
 # It looks like the appropriate way to convert the `List<InetAddressAndPort>` 
to `replicaCollection` type is to call 
`[SystemReplicas.getSystemReplicas|https://github.com/apache/cassandra/blob/8b429c8ef9d9907dc3a435ffe7371ec69a9a85e5/src/java/org/apache/cassandra/locator/SystemReplicas.java#L50]`
 which specifically lists batchlog as a valid use case

For "{+}Why not go with Dynamic Snitch?"{+} [~mck] and [~maximc] provide strong 
arguments against Dynamic snitch. That it is often not enabled, and that 
requiring users to take an action to change their snitch to get this fixed 
isn't ideal, and could impact performance elsewhere. IFailureDetector's PHI 
might be a good alternative, but I'm not entirely sure it works how you expect:
{quote}A couple of assumptions here:

...

The lower PHI value is, the faster responses from a node arrive. It looks this 
way from CASSANDRA-2597 explanation regarding ArrivalWindow and the [official 
PHI 
paper|https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf].
{quote}
 I had a different understanding after reading CASSANDRA-2597. It sounds to me 
like Failure detector and dynamic snitch use different calculations, and *that 
Dynamic Snitch does sort by lowest latency, but failure detector does not*
{quote}the developer found that the {{score()}} values were going _down_ for 
nodes with higher average latency instead of up ... the phi accrual failure 
detector assigns higher badness values to nodes with a low recent latency than 
to nodes with high recent latency
{quote}
Have you been able to test and confirm that the failure detector PHI is 
behaving as you expect?

> Single slow node dramatically reduces cluster write throughput regardless of 
> CL
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18120
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18120
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Dan Sarisky
>            Assignee: Maxim Chanturiay
>            Priority: Normal
>
> We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO, 
> QUORUM, or LOCAL_QUORUM)
>  
> On clusters of any size - a single extremely slow node causes a ~90% loss of 
> cluster-wide throughput using batched writes.  We can replicate this in the 
> lab via CPU or disk throttling.  I observe this in 3.11, 4.0, and 4.1.
>  
> It appears the mechanism in play is:
> Those logged batches are immediately written to two replica nodes and the 
> actual mutations aren't processed until those two nodes acknowledge the batch 
> statements.  Those replica nodes are selected randomly from all nodes in the 
> local data center currently up in gossip.  If a single node is slow, but 
> still thought to be up in gossip, this eventually causes every other node to 
> have all of its MutationStages to be waiting while the slow replica accepts 
> batch writes.
>  
> The code in play appears to be:
> See
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245].
>   In the method filterBatchlogEndpoints() there is a
> Collections.shuffle() to order the endpoints and a
> FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
>  
> This behavior causes Cassandra to move from a multi-node fault tolerant 
> system toa collection of single points of failure.
>  
> We try to take administrator actions to kill off the extremely slow nodes, 
> but it would be great to have some notion of "what node is a bad choice" when 
> writing log batches to replica nodes.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-18120) Single slow node dramatically reduces cluster write throughput regardless of CL

Reply via email to