[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534260#comment-14534260
 ] 

Benedict edited comment on CASSANDRA-9318 at 5/8/15 11:24 AM:
--------------------------------------------------------------

bq. In my mind in-flight means that until the response is sent back to the 
client the request counts against the in-flight limit.
bq. We already keep the original request around until we either get acks from 
all replicas, or they time out (and we write a hint).

Perhaps we need to clarify more clearly what this ticket is proposing. I 
understood it to mean what Ariel commented here, whereas it sounds like 
Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked 
as well as time?

The ExpiringMap requests are already "in-flight" and cannot be cancelled, so 
their effect on other nodes cannot be rescinded, and imposing a limit does not 
stop us issuing more requests to the nodes in the cluster that are failing to 
keep up and respond to us. It _might_ be sensible if introducing more 
aggressive shedding at processing nodes to also shed the response handlers more 
aggressively locally, but I'm not convinced it would have a significant impact 
on cluster health by itself; cluster instability spreads out from problematic 
nodes, and this scheme still permits us to inundate those nodes with requests 
they cannot keep up with.

Alternatively, the approach of forbidding new requests if you have items in the 
ExpiringMap causes the collapse of other nodes to spread throughout the 
cluster, as rapidly (especially on small clusters) all requests on the system 
are destined for the collapsing node, and every coordinator stops accepting 
requests. The system seizes up, and that node is still failing since it's got 
requests from the entire cluster queued up with it. In general on the 
coordinator there's no way of distinguishing between a failed node, network 
partition, or just struggling, so we don't know if we should wait.

Some mix of the two might be possible, if we were to wait while a node is just 
slow, then drop our response handlers for the node if it's marked as down. This 
latter may not be a bad thing to do anyway, but I would not want to depend on 
this behaviour to maintain our precious "A"

It still seems the simplest and most robust solution is to make our work queues 
leaky, since this insulates the processing nodes from cluster-wide inundation, 
which the coordinator approach cannot (even with the loss of "A" and cessation 
of processing, there is a whole cluster vs a potentially single node; with hash 
partition it doesn't take long for all processing to begin involving a single 
failing node). We can do this just on the number of requests and still be much 
better than we are currently. 

We could also pair this with coordinator-level dropping of handlers for "down" 
nodes, and above a size/count threshold. This latter, though, could result in 
widespread uncoordinated dropping of requests, which may leave us open to a 
multiplying effect of cluster overload, with each node dropping different 
requests, possibly leading to only a tiny fraction of requests being serviced 
to their required CL across the cluster. I'm not sure how we can best model 
this risk, or avoid it without notifying coordinators of the drop of a message, 
and I don't see that being delivered for 2.1


was (Author: benedict):
bq. In my mind in-flight means that until the response is sent back to the 
client the request counts against the in-flight limit.
bq. We already keep the original request around until we either get acks from 
all replicas, or they time out (and we write a hint).

Perhaps we need to clarify more clearly what this ticket is proposing. I 
understood it to mean what Ariel commented here, whereas it sounds like 
Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked 
as well as time?

The ExpiringMap requests are already "in-flight" and cannot be cancelled, so 
their effect on other nodes cannot be rescinded, and imposing a limit does not 
stop us issuing more requests to the nodes in the cluster that are failing to 
keep up and respond to us. It _might_ be sensible if introducing more 
aggressive shedding at processing nodes to also shed the response handlers more 
aggressively locally, but I'm not convinced it would have a significant impact 
on cluster health by itself; cluster instability spreads out from problematic 
nodes, and this scheme still permits us to inundate those nodes with requests 
they cannot keep up with.

Alternatively, the approach of forbidding new requests if you have items in the 
ExpiringMap causes the collapse of other nodes to spread throughout the 
cluster, as rapidly (especially on small clusters) all requests on the system 
are destined for the collapsing node, and every coordinator stops accepting 
requests. The system seizes up, and that node is still failing since it's got 
requests from the entire cluster queued up with it. In general on the 
coordinator there's no way of distinguishing between a failed node, network 
partition, or just struggling, so we don't know if we should wait.

Some mix of the two might be possible, if we were to wait while a node is just 
slow, then drop our response handlers for the node if it's marked as down. This 
latter may not be a bad thing to do anyway, but I would not want to depend on 
this behaviour to maintain our precious "A"

It still seems the simplest and most robust solution is to make our work queues 
leaky, since this insulates the processing nodes from cluster-wide inundation, 
which the coordinator approach cannot (even with the loss of "A" and cessation 
of processing, there is a whole cluster vs a potentially single node; with hash 
partition it doesn't take long for all processing to begin involving a single 
failing node). We can do this just on the number of requests and still be much 
better than we are currently. 

We could also pair this with coordinator-level dropping of handlers for "down" 
nodes, and above a threshold. This latter, though, could result in widespread 
uncoordinated dropping of requests, which may leave us open to a multiplying 
effect of cluster overload, with each node dropping different requests, 
possibly leading to only a tiny fraction of requests being serviced to their 
required CL across the cluster. I'm not sure how we can best model this risk, 
or avoid it without notifying coordinators of the drop of a message, and I 
don't see that being delivered for 2.1

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to