[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370159#comment-15370159
 ] 

Stefania commented on CASSANDRA-9318:
-------------------------------------

bq. I've pushed a few more commits to address your concerns.

The new commits look good, the code is clearly cleaner with an abstract 
BackPressureState created by the strategy, and the back-pressure window equal 
to the write timeout. Some things left to do:

* Rebase on trunk
* Get rid of the trailing spaces

bq.  more specifically, the rates are now tracked together when either a 
response is received or the callback is expired,

I have some concerns with this:

* Do we track the case where we receive a failed response? Specifically, in 
ResponseVerbHandler.doVerb, shouldn't we call updateBackPressureState() also 
when the message is a failure response? 
* If we receive a response after it has timed out, won't we count that request 
twice, incorrectly increasing the rate for that window?

bq. Configuration-wise, we're now left with only the back_pressure_enabled 
boolean and the back_pressure_strategy, and I'd really like to keep the former, 
as it makes way easier to dynamically turn the back-pressure on/off.

It's much better right now and I am not strongly opposed to keeping 
back_pressure_enabled, but I also argue that it is quite easy to comment out 
the strategy and to have an empty strategy in the code that means no 
backpressure.

bq. Talking about the overloaded state and the usage of OverloadedException, I 
agree the latter might be misleading, and I agree some failure conditions could 
lead to requests being wrongly refused, but I'd also like to keep some form of 
"emergency" feedback towards the client: what about throwing OE only if all (or 
a given number depending on the CL?) replicas are overloaded?

I think what we may need is a new companion snitch that sorts the replica by 
backpressure ratio, or an enhanced dynamic snitch that is aware of the 
backpressure strategy. If we sort replicas correctly, then we can throw an 
exception to the client if one of the selected replicas are overloaded. 
However, I think the exception needs to be different. native_protocol_v4.spec 
clearly states:

{code}
0x1001    Overloaded: the request cannot be processed because the coordinator 
node is overloaded
{code}

Sorry I totally missed the meaning of this exception in the previous round of 
review. 

bq. Finally, one more wild idea to consider: given this patch greatly reduces 
the number of dropped mutations, and hence the number of inflight hints, what 
do you think about disabling load shedding by the replica side when 
back-pressure is enabled? This way we'd trade "full consistency" for an 
hopefully smaller number of unnecessary hints sent over to "pressured" replicas 
when their callbacks expire by the coordinator side.

By "load shedding by the replica" do we mean dropping mutations that have timed 
out or something else?

Regardless, there is the problem of ensuring that all nodes have backpressure 
enabled, which may not be trivial. I think replicas should have a mechanism of 
protecting themselves, rather than relying on the coordinator but I can comment 
further if we clarify what we mean exactly.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to