[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535369#comment-14535369 ]
Benedict edited comment on CASSANDRA-9318 at 5/8/15 7:52 PM: ------------------------------------------------------------- bq. because our existing load shedding is fine at recovering from temporary spikes in load Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; and this was in a small cluster. If anything, I suspect our ability to prune these messages is also theoretically worse, on top of this practical datapoint, because it is done on dequeue, whereas the ExpiringMap and MessagingService (whilst having a slightly longer expiry) is done asynchronously (or on enqueue) and cannot be blocked by e.g. flush. What I'm effectively suggesting is simply making all of the load shedding all happen on enqueue, and be based on queue length as well as time, so that our load shedding really is simply more robust. The coordinator is also on the "right side" of the equation: as the cluster grows, any single problems should spread out to the coordinators more slowly, whereas the coordinator's ability to flood a processing node scales up at the same (well, inverted) rate. was (Author: benedict): bq. because our existing load shedding is fine at recovering from temporary spikes in load Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; and this was in a small cluster. If anything, I suspect our ability to prune these messages is also theoretically worse, on top of this practical datapoint, because it is done on dequeue, whereas the ExpiringMap (whilst having a slightly longer expiry) is done asynchronously and cannot be blocked by e.g. flush. The coordinator is also on the "right side" of the equation: as the cluster grows, any single problems should spread out to the coordinators more slowly, whereas the coordinator's ability to flood a processing node scales up at the same (well, inverted) rate. > Bound the number of in-flight requests at the coordinator > --------------------------------------------------------- > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Reporter: Ariel Weisberg > Assignee: Ariel Weisberg > Fix For: 2.1.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)