[jira] [Commented] (CASSANDRA-15442) Read repair implicitly increases read timeout value

Yifan Cai (Jira) Wed, 11 Dec 2019 16:47:22 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994025#comment-16994025
 ]


Yifan Cai commented on CASSANDRA-15442:
---------------------------------------

Putting some formula here to analyze the impact of allowing longer mutation 
time for read repair prudently. 

For simplicity, using a simple and stable queue to model the request handling. 
We have,

_M / Lm = Rm_

where _M_ is the average amount of mutations at a given time window, _Lm_ is 
the average latency and _Rm_ is the rate. 

If some of the mutation can have a longer timeout, the new average latency is 

_Lm’ = (Lm * M1 + L * M2) / M_

where _M1_ is the regular mutation, and _M2_ is the amount of the super slow 
mutation (without increasing the timeout, they will timeout) from read repair, 
and _L_ is the average latency those slow mutations take. _M1_ and _M2_ 
satisfies,
 * _M1 + M2 = M_
 * _M2 = M * Prr * Pmto_, _Prr_ is the observed percentage of read repair 
mutation vs. total, and _Pmto_ is the observed percentage of the timeouted 
mutation vs. total.

The range of _L_ is 

_Lm < L <= R_, _R_ is the configured read timeout

Therefore, we have

_Lm' <= Lm + (R - Lm) * Prr * Pmto_

_Rm' >= M / (Lm + (R - Lm) * Prr * Pmto)_

Let _K = (R - Lm) / Lm_, since they are constants. The ratio between the prior 
rate and the new rate with the change is,

_1 < Rm / Rm' <= 1 + K * Prr * Pmto_ 

Based on the equation, the percentage of read repair and mutation timeouts are 
the factors to the throughput. 

In a health cluster, i.e. low mutation timeout rate, the ratio should be close 
to 1, meaning the impact is small. 

> Read repair implicitly increases read timeout value
> ---------------------------------------------------
>
>                 Key: CASSANDRA-15442
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15442
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Yifan Cai
>            Assignee: Yifan Cai
>            Priority: Normal
>
> When read repair occurs during a read, internally, it starts several 
> _blocking_ operations in sequence. See 
> {{org.apache.cassandra.service.StorageProxy#fetchRows}}. 
>  The timeline of the blocking operations
>  # Regular read, wait for full data/digest read response to complete. 
> {{reads[*].awaitResponses();}}
>  # Read repair read, wait for full data read response to complete. 
> {{reads[*].awaitReadRepair();}}
>  # Read repair write, wait for write response to complete. 
> {{concatAndBlockOnRepair(results, repairs);}}
> Step 1 and 2 share the same timeout, and wait for the duration of read 
> timeout, say 5 s.
> Step 3 waits for the duration of write timeout, say 2 s.
> In the worse case, the actual time taken for a read could accumulate to ~7 s, 
> if each individual step does not exceed the timeout value.
> From the client perspective, it may not expect a request taken higher than 
> the database configured timeout value. 
> Such scenario is especially bad for the clients that have set up client-side 
> timeout monitoring close to the configured one. The clients think the 
> operations timed out and abort, but they are in fact still running on server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15442) Read repair implicitly increases read timeout value

Reply via email to