[
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540051#comment-14540051
]
Benedict commented on CASSANDRA-9318:
-------------------------------------
bq. The problem is that we need to give the clients better feedback so they
know to modify their behavior.
I should make it clear I'm not at all opposed to the idea of back pressure. I
have voiced in favour of it many times. However this design as proposed (or, as
I'm inferring, there isn't a formal proposal I don't think? Would be helpful
still, to make sure we are discussing the same thing) does not seem safe to me.
Fundamentally I don't see how you can safely distinguish between a "slow" node
that is under load that will catch up shortly, and a dead node, at least
without an active "congestion control" algorithm as Ariel described it.
Stopping accepting queries for dead nodes is a catastrophic loss of "A". If you
have an elegant solution to this that can be implemented in this coordinator
level rate limiting, the only real showstopping concern I have is alleviated,
but I don't currently see one. It seems we absolutely have to have a positive
signal from the processing node to slow down, and if we lose that signal we
should continue accepting work (but potentially hint), and that is essentially
the congestion control, and probably really for 2.1. Depending on gossip is not
sufficient (i.e. only implementing this algorithm while nodes are UP) since
there will be an indeterminate period of crossover during which we lose our "A"
bq. we can keep the coordinator from falling over which is what turns a
single-node hiccup into a cluster-wide problem.
We seem to be conflating two goals here: stopping the cluster falling over, and
stopping clients from spamming it. I'm pretty sure we can do the former in 2.1
safely with improved shedding. The latter seems much more difficult than it is
being given credit for, and since the solution being proposed clearly affects
the semantics of our headline feature I'm unconvinced it is mid-release
material.
> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Ariel Weisberg
> Assignee: Ariel Weisberg
> Fix For: 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding
> bytes and requests and if it reaches a high watermark disable read on client
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't
> introduce other issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)