Dmitry Konstantinov created CASSANDRA-20743:
-----------------------------------------------

             Summary: Inflation for speculative retry 99% threshold if one 
replica is slow
                 Key: CASSANDRA-20743
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20743
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Consistency/Coordination
            Reporter: Dmitry Konstantinov
            Assignee: Dmitry Konstantinov


I have executed a set of LOCAL_QUORUM read tests with 3 node Cassandra cluster 
(4.1.x) when for one of the nodes a slow disk IO read is emulated using a 
configured delay added to SSTable disk-level reads with a configured 
probability. The purpose of these tests is to ensure that Cassandra does not 
degrade a lot from latency point of view if a single replica is not healthy.

During such tests I observe an interesting behaviour: drift/inflation for 
speculative retry threshold value.
We have a coordinator node, which is a replica as well. Let's assume we have an 
injected read delay = 100ms with 2% probability within this node and 2 other 
nodes are healthy. Usual read is executed from the local node + one of the 
remote nodes.
Because of the introduced delay for 2% of requests we cross speculative retry 
threshold value and run a speculative retry to the second remote replica.
The speculative retry threshold value is calculated as a +coordinator latency+ 
99% by default; in these 2% cases the coordinator latency is actually equal to 
time to wait till speculative retry + time to execute the request to a remote 
replica, so we contribute this value back to our coordinator latency metric and 
actually create a degradation feedback loop: while the 2% delay for the local 
disk reads is in place the speculative retry threshold value will grow in steps 
= time to execute the request to a remote replica, degrading more and more.

A possible WA is to use MIN(99p,Xms) speculative retry option introduced in 
CASSANDRA-14293 but it is env specific, may depends on workload, so it can be 
not so easy to define the right value for X..

I have found the same issue reported for ScyllaDB - 
https://github.com/scylladb/scylladb/pull/8783 , to address it they started to 
use replica read response times instead of a full coordinator read time for 
speculative retry threshold value evaluation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to