[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374158#comment-15374158
 ] 

Stefania commented on CASSANDRA-9318:
-------------------------------------

bq. Right, but there isn't much we can do without way more invasive changes. 
Anyway, I don't think that's actually a problem, as if the coordinator is 
overloaded we'll end up generating too many hints and fail with 
OverloadedException (this time with its original meaning), so we should be 
covered.

I tend to agree that it is an approximation we can live with; I also would 
rather not change the lower levels of messaging service for this.

bq. Does it mean we should advance the protocol version in this issue, or 
delegate to a new issue?

We have a number of issues waiting for protocol V5, they are labeled as 
{{protocolv5}}. Either we make this issue dependent on V5 as well or, since we 
are committing this as disabled, we delegate to a new issue that is dependent 
on V5.

bq. Do you see any complexity I'm missing there?

A new flag would involve a new version and it would need to be handled during 
rolling upgrades. Even if on its own it is not too complex, the system in its 
entirety becomes even more complex (different versions, compression, 
cross-node-timeouts, some verbs are droppable, others aren't and the list goes 
on). Unless it solves a problem, I don't think we should consider it; and we 
are saying in other parts of this conversation that hints are no longer a 
problem. 

bq. as the advantage would be increased consistency at the expense of more 
resource consumption, 

We don't increase consistency if the client has been told the mutation failed 
IMO. If we are instead referring to replicas that were out of the CL pool and 
temporarily overloaded, I think they are better off dropping mutations and 
handling them later on through hints. Basically, I see dropping mutations 
replica side as a self defense mechanism for replicas, I don't think we should 
remove it, rather we should focus on a backpressure strategy such that replicas 
don't need to drop mutations. Also, for the time being, I'd rather focus on the 
major issue, which is that we haven't reached consensus on how to apply 
backpressure yet, and propose this new idea in a follow up ticket if 
backpressure is successful.

bq. These are valid concerns of course, and given similar concerns from 
Jonathan Ellis, I'm working on some changes to avoid write timeouts due to 
healthy replicas unnaturally throttled by unhealthy ones, and depending on 
Jonathan Ellis answer to my last comment above, maybe only actually 
back-pressure if the CL is not met.

OK, so we are basically trying to address the 3 scenarios by throttling/failing 
only if the system as a whole cannot handle the mutations (that is at least CL 
replicas are slow/overloaded) whereas if less than CL replicas are 
slow/overloaded, those replicas get hinted?

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to