[
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376955#comment-15376955
]
Ariel Weisberg commented on CASSANDRA-9318:
-------------------------------------------
There really isn't much memory to play with when deciding when to backpressure.
There are 128 requests threads and once those are all consumed by a slow node,
which doesn't take long in a small cluster, things stall completely. If things
were async you might be able to commit enough memory that requests time out
before you need to stall. In other words you can shed via timeouts to nodes and
no additional mechanisms are needed.
Not reading from clients doesn't address the issue. You have still created a
situation in which nodes that are performing well can't make progress because
they can no longer read requests from clients because of one slow node. Not
reading from clients is the current implementation.
Hinting as it works now doesn't address the issue because the slow node may
never actually catch up or become faster. Waiting for every request that is
going to time out to time out and be hinted is going to restrict the
coordinators ability to coordinate. Hinting also doesn't work because there are
only 128 concurrent requests that can be in the process of being hinted see
paragraph #1.
If the coordinator wants to continue to make progress it has to read requests
from clients and then quickly know if it should shed them. We could shed them
silently in which case the upstream client is going to time out and it's going
to exhaust it's memory or threadpool and we have silently and unfixably moved
the problem upstream. I suppose clients can try and implement their own health
metrics to duplicate the work we are doing at the coordinator, but it still
can't force the coordinator to shed so the client can replaced those requests
that won't succeed with ones that will.
Or we can signal that we aren't going to do that request at this time and the
client can engage whatever mitigation strategy it wants to implement. There is
a whole separate discussion about what the state of the art needs to be in
client drivers to do something useful with this information and how to expose
the mechanism and policy choices to applications.
Rate limiting isn't really useful. You just end up with all the request threads
stuck in the rate limiter and coordinators continue to not make progress. Rate
limiting doesn't solve a load issue at the remote end because as I've
demonstrated the remote end can buffer up enough requests until shedding kicks
in due to the timeout and reduces memory utilization to something the heap can
handle.
If things were async what would rate limiting look like? Would it be disabling
read for clients? How is the coordinator going to make progress then if it
can't coordinate requests for healthy nodes?
> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
> Issue Type: Improvement
> Components: Local Write-Read Paths, Streaming and Messaging
> Reporter: Ariel Weisberg
> Assignee: Sergio Bossa
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png,
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding
> bytes and requests and if it reaches a high watermark disable read on client
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't
> introduce other issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)