[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540051#comment-14540051
 ] 

Benedict commented on CASSANDRA-9318:
-------------------------------------

bq. The problem is that we need to give the clients better feedback so they 
know to modify their behavior.

I should make it clear I'm not at all opposed to the idea of back pressure. I 
have voiced in favour of it many times. However this design as proposed (or, as 
I'm inferring, there isn't a formal proposal I don't think? Would be helpful 
still, to make sure we are discussing the same thing) does not seem safe to me.

Fundamentally I don't see how you can safely distinguish between a "slow" node 
that is under load that will catch up shortly, and a dead node, at least 
without an active "congestion control" algorithm as Ariel described it. 
Stopping accepting queries for dead nodes is a catastrophic loss of "A". If you 
have an elegant solution to this that can be implemented in this coordinator 
level rate limiting, the only real showstopping concern I have is alleviated, 
but I don't currently see one. It seems we absolutely have to have a positive 
signal from the processing node to slow down, and if we lose that signal we 
should continue accepting work (but potentially hint), and that is essentially 
the congestion control, and probably really for 2.1. Depending on gossip is not 
sufficient (i.e. only implementing this algorithm while nodes are UP) since 
there will be an indeterminate period of crossover during which we lose our "A"

bq.  we can keep the coordinator from falling over which is what turns a 
single-node hiccup into a cluster-wide problem.

We seem to be conflating two goals here: stopping the cluster falling over, and 
stopping clients from spamming it. I'm pretty sure we can do the former in 2.1 
safely with improved shedding. The latter seems much more difficult than it is 
being given credit for, and since the solution being proposed clearly affects 
the semantics of our headline feature I'm unconvinced it is mid-release 
material.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to