[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

Jonathan Ellis (JIRA) Fri, 08 May 2015 12:41:38 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535356#comment-14535356
 ]


Jonathan Ellis edited comment on CASSANDRA-9318 at 5/8/15 7:40 PM:
-------------------------------------------------------------------

bq. it sounds like Jonathan is suggesting we simply prune our ExpiringMap based 
on bytes tracked as well as time?

No, I'm suggesting we abort requests more aggressively with OverloadedException 
*before sending them to replicas*.  One place this might make sense is 
sendToHintedEndpoints, where we already throw OE.

Right now we only throw OE once we start writing hints for a node that is in 
trouble.  This doesn't seem to be aggressive enough.  (Although, most of our 
users are on 2.0 where we allowed 8x as many hints in flight before starting to 
throttle.)

So, I am suggesting we also track requests outstanding (perhaps with the 
ExpiringMap as you suggest) as well and stop accepting requests once we hit a 
reasonable limit of "you can't possibly process more requests than this in 
parallel."

bq. The ExpiringMap requests are already "in-flight" and cannot be cancelled, 
so their effect on other nodes cannot be rescinded, and imposing a limit does 
not stop us issuing more requests to the nodes in the cluster that are failing 
to keep up and respond to us.

Right, and I'm fine with that.  The goal is not to keep the replica completely 
out of trouble.  The goal is to keep the coordinator from falling over from 
buffering EM and MessagingService entries that it can't drain fast enough.  
Secondarily, this will help the replica too because our existing load shedding 
is fine at recovering from temporary spikes in load.  But our load shedding 
isn't good enough to save it when the coordinators keep throwing more at it 
when it's already overwhelmed.


was (Author: jbellis):
bq. it sounds like Jonathan is suggesting we simply prune our ExpiringMap based 
on bytes tracked as well as time?

No, I'm suggesting we abort requests more aggressively with OverloadedException 
*before sending them to replicas*.  One place this might make sense is 
sendToHintedEndpoints, where we already throw OE.

Right now we only throw OE once we start writing hints for a node that is in 
trouble.  This doesn't seem to be aggressive enough.  (Although, most of our 
users are on 2.0 where we allowed 8x as many hints in flight before starting to 
throttle.)

So, I am suggesting we also track requests outstanding (perhaps with the 
ExpiringMap as you suggest) as well and stop accepting requests once we hit a 
reasonable limit of "you can't possibly process more requests than this in 
parallel."

> The ExpiringMap requests are already "in-flight" and cannot be cancelled, so 
> their effect on other nodes cannot be rescinded, and imposing a limit does 
> not stop us issuing more requests to the nodes in the cluster that are 
> failing to keep up and respond to us.

Right, and I'm fine with that.  The goal is not to keep the replica completely 
out of trouble.  The goal is to keep the coordinator from falling over from 
buffering EM and MessagingService entries that it can't drain fast enough.  
Secondarily, this will help the replica too because our existing load shedding 
is fine at recovering from temporary spikes in load.  But our load shedding 
isn't good enough to save it when the coordinators keep throwing more at it 
when it's already overwhelmed.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

Reply via email to