[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376629#comment-15376629
 ] 

Sergio Bossa commented on CASSANDRA-9318:
-----------------------------------------

[~jbellis],

yes I think we're going in circles, and probably either one or the other is 
missing each other point.

You said:

bq. Pick a number for how much memory we can afford to have taken up by in 
flight requests (remembering that we need to keep the entirely payload around 
for potential hint writing) as a fraction of the heap, the way we do with 
memtables or key cache. If we hit that mark we start throttling new requests 
and only accept them as old ones drain off.

I'll try a last attempt at explaining why _in my opinion_ this is worse _in all 
accounts_ via an example.

Say you have:
* U: the memory unit to express back-pressure.
* T: the write RPC timeout.
* 3 nodes cluster with RF=2.
* CL.ONE.

Say the coordinator memory threshold is at 10U, and clients hit it with 
requests worth 20Us, and for simplicity the coordinator is not a replica, which 
means it has to accommodate 40U of inflight requests. At some point P < T, the 
first replica answers all of them, while the second replica answers only half, 
which means the memory threshold is met at 10U requests, which will be drained 
when either the replica answers (let's call this time interval R) or T elapses. 
This means:
1) During a time period equal to min(R,T) you'll have 0 throughput. This is 
made worse if you actually have to wait for the T to elapse, as it gets 
proportional to T. 
2) If replica 2 keeps exhibiting the same slow behaviour, you'll keep having a 
throughput profile of high peaks and 0 valleys; this is bad not just because of 
the 0 valleys, but also because during the high peaks the slow replica will end 
up dropping 10U worth of mutations.

Both #1 and #2 look pretty bad outcomes to me and definitely no better than 
throttling at the speed of the slowest replica.

Speaking of which, how would the other algorithm work in such case? Given the 
same scenario above, we'd have the following:
1) During the first T period (T1), there will be no back-pressure (this is 
because the algorithm works at time windows of size T), so nothing changes from 
current behaviour.
2) At T2, it will start throttling at the slow replica rate of 10U (obviously 
the actual throttling is not memory based in this case but I'll keep the same 
notation for simplicity): this means that the sustained throughput during T2 
will be 10U.
3) For all Tn > T2, throughput will *gradually* increase and eventually reduce, 
but without ever touching 0, which means _no high peaks causing high dropped 
mutations rate_, _no 0 valleys of length T_.

Hope this clarifies my reasoning, and please let me know if/how my example 
above is flawed (that is, if I keep missing your point).

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to