[
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605345#comment-14605345
]
Benedict commented on CASSANDRA-9318:
-------------------------------------
bq. I think you're \[Benedict\] overselling how scary it is to stop reading new
requests until we can free up some memory from MS.
The problem is that freeing up memory can be constrained by one or a handful of
_dead_ nodes. We can not only stop accepting work, but significantly reduce
cluster throughput as a result of a *single* timed-out node. I'm not
overselling anything, although I may have a different risk analysis than you do.
Take a simple mathematical thought experiment: we have a four node cluster
(pretty common), with RF=3, serving 100kop/s per coordinator; these operations
in memory occupy around 2K as a Mutation (again, pretty typical). Ordinary
round-trip time is 10ms (also, pretty typical).
So, under normal load we would see around 2Mb of data maintained for our
queries in-flight across the cluster. But now one node goes down. This node is
a peer for 3/4 of all writes to the cluster, so we see 150Mb/s of data
accumulate in each coordinator. Our limit is probably no more than 300Mb
(probably lower). Our timeout is 10s. So we now have 8s during which nothing
can be done, across the cluster, due to one node's death. After that 8s has
elapsed, we get another flurry. Then another 8s of nothing. Even with a CL of
ONE.
This really is fundamentally opposed to the whole idea of Cassandra, and I
cannot emphasize how much I am against it except as a literal last resort when
all other strategies have failed.
bq. Hinting is better than leaving things in an unknown state but it's not
something we should opt users into if we have a better option, since it
basically turns the write into CL.ANY.
I was under the impression we had moved to talking about ACK'd writes. I'm not
suggesting we ack with success to the handler.
What we do with unack'd writes is actually less important, and we have a much
freer reign with. We could throw OE. We could block, as you suggest, since
these should be more evenly distributed.
However I would prefer we do both, i.e., when we run out of room in the
coordinator, we should look to see if there are any nodes that have well in
excess of their fair share of entries waiting for a response. Let's call these
nodes N
# if N=0, we block consumption of new writes, as you propose.
# otherwise, we first evict those that have been ACK'd to the client and can be
safely hinted (and hint them)
# if this isn't enough, we evict handlers that, if all N were to fail, would
break the CL we are waiting on, and we throw OE
step 3 is necessary both for CL.ALL, and the scenario where 2 failing nodes
have significant overlap
> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Ariel Weisberg
> Assignee: Ariel Weisberg
> Fix For: 2.1.x, 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding
> bytes and requests and if it reaches a high watermark disable read on client
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't
> introduce other issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)