Dmitry Konstantinov created CASSANDRA-20743: -----------------------------------------------
Summary: Inflation for speculative retry 99% threshold if one replica is slow Key: CASSANDRA-20743 URL: https://issues.apache.org/jira/browse/CASSANDRA-20743 Project: Apache Cassandra Issue Type: Bug Components: Consistency/Coordination Reporter: Dmitry Konstantinov Assignee: Dmitry Konstantinov I have executed a set of LOCAL_QUORUM read tests with 3 node Cassandra cluster (4.1.x) when for one of the nodes a slow disk IO read is emulated using a configured delay added to SSTable disk-level reads with a configured probability. The purpose of these tests is to ensure that Cassandra does not degrade a lot from latency point of view if a single replica is not healthy. During such tests I observe an interesting behaviour: drift/inflation for speculative retry threshold value. We have a coordinator node, which is a replica as well. Let's assume we have an injected read delay = 100ms with 2% probability within this node and 2 other nodes are healthy. Usual read is executed from the local node + one of the remote nodes. Because of the introduced delay for 2% of requests we cross speculative retry threshold value and run a speculative retry to the second remote replica. The speculative retry threshold value is calculated as a +coordinator latency+ 99% by default; in these 2% cases the coordinator latency is actually equal to time to wait till speculative retry + time to execute the request to a remote replica, so we contribute this value back to our coordinator latency metric and actually create a degradation feedback loop: while the 2% delay for the local disk reads is in place the speculative retry threshold value will grow in steps = time to execute the request to a remote replica, degrading more and more. A possible WA is to use MIN(99p,Xms) speculative retry option introduced in CASSANDRA-14293 but it is env specific, may depends on workload, so it can be not so easy to define the right value for X.. I have found the same issue reported for ScyllaDB - https://github.com/scylladb/scylladb/pull/8783 , to address it they started to use replica read response times instead of a full coordinator read time for speculative retry threshold value evaluation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org