[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535369#comment-14535369
 ] 

Benedict edited comment on CASSANDRA-9318 at 5/8/15 7:52 PM:
-------------------------------------------------------------

bq. because our existing load shedding is fine at recovering from temporary 
spikes in load

Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated 
the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; 
and this was in a small cluster.

If anything, I suspect our ability to prune these messages is also 
theoretically worse, on top of this practical datapoint, because it is done on 
dequeue, whereas the ExpiringMap and MessagingService (whilst having a slightly 
longer expiry) is done asynchronously (or on enqueue) and cannot be blocked by 
e.g. flush. What I'm effectively suggesting is simply making all of the load 
shedding all happen on enqueue, and be based on queue length as well as time, 
so that our load shedding really is simply more robust.

The coordinator is also on the "right side" of the equation: as the cluster 
grows, any single problems should spread out to the coordinators more slowly, 
whereas the coordinator's ability to flood a processing node scales up at the 
same (well, inverted) rate.


was (Author: benedict):
bq. because our existing load shedding is fine at recovering from temporary 
spikes in load

Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated 
the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; 
and this was in a small cluster.

If anything, I suspect our ability to prune these messages is also 
theoretically worse, on top of this practical datapoint, because it is done on 
dequeue, whereas the ExpiringMap (whilst having a slightly longer expiry) is 
done asynchronously and cannot be blocked by e.g. flush.

The coordinator is also on the "right side" of the equation: as the cluster 
grows, any single problems should spread out to the coordinators more slowly, 
whereas the coordinator's ability to flood a processing node scales up at the 
same (well, inverted) rate.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to