[jira] [Updated] (CASSANDRA-18120) Single slow node dramatically reduces cluster logged batch write throughput regardless of CL

2024-05-23 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-18120:
---
Change Category: Operability
 Complexity: Normal
Component/s: Consistency/Coordination
  Fix Version/s: 4.0.x
 4.1.x
 5.0.x
 5.x
  Reviewers: Michael Semb Wever
 Status: Open  (was: Triage Needed)

> Single slow node dramatically reduces cluster logged batch write throughput 
> regardless of CL
> 
>
> Key: CASSANDRA-18120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18120
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination
>Reporter: Dan Sarisky
>Assignee: Maxim Chanturiay
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO, 
> QUORUM, or LOCAL_QUORUM)
>  
> On clusters of any size - a single extremely slow node causes a ~90% loss of 
> cluster-wide throughput using batched writes.  We can replicate this in the 
> lab via CPU or disk throttling.  I observe this in 3.11, 4.0, and 4.1.
>  
> It appears the mechanism in play is:
> Those logged batches are immediately written to two replica nodes and the 
> actual mutations aren't processed until those two nodes acknowledge the batch 
> statements.  Those replica nodes are selected randomly from all nodes in the 
> local data center currently up in gossip.  If a single node is slow, but 
> still thought to be up in gossip, this eventually causes every other node to 
> have all of its MutationStages to be waiting while the slow replica accepts 
> batch writes.
>  
> The code in play appears to be:
> See
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245].
>   In the method filterBatchlogEndpoints() there is a
> Collections.shuffle() to order the endpoints and a
> FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
>  
> This behavior causes Cassandra to move from a multi-node fault tolerant 
> system toa collection of single points of failure.
>  
> We try to take administrator actions to kill off the extremely slow nodes, 
> but it would be great to have some notion of "what node is a bad choice" when 
> writing log batches to replica nodes.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18120) Single slow node dramatically reduces cluster logged batch write throughput regardless of CL

2024-05-23 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-18120:
---
Summary: Single slow node dramatically reduces cluster logged batch write 
throughput regardless of CL  (was: Single slow node dramatically reduces 
cluster write throughput regardless of CL)

> Single slow node dramatically reduces cluster logged batch write throughput 
> regardless of CL
> 
>
> Key: CASSANDRA-18120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18120
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Dan Sarisky
>Assignee: Maxim Chanturiay
>Priority: Normal
>
> We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO, 
> QUORUM, or LOCAL_QUORUM)
>  
> On clusters of any size - a single extremely slow node causes a ~90% loss of 
> cluster-wide throughput using batched writes.  We can replicate this in the 
> lab via CPU or disk throttling.  I observe this in 3.11, 4.0, and 4.1.
>  
> It appears the mechanism in play is:
> Those logged batches are immediately written to two replica nodes and the 
> actual mutations aren't processed until those two nodes acknowledge the batch 
> statements.  Those replica nodes are selected randomly from all nodes in the 
> local data center currently up in gossip.  If a single node is slow, but 
> still thought to be up in gossip, this eventually causes every other node to 
> have all of its MutationStages to be waiting while the slow replica accepts 
> batch writes.
>  
> The code in play appears to be:
> See
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245].
>   In the method filterBatchlogEndpoints() there is a
> Collections.shuffle() to order the endpoints and a
> FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
>  
> This behavior causes Cassandra to move from a multi-node fault tolerant 
> system toa collection of single points of failure.
>  
> We try to take administrator actions to kill off the extremely slow nodes, 
> but it would be great to have some notion of "what node is a bad choice" when 
> writing log batches to replica nodes.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org