[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373624#comment-15373624
 ] 

Sergio Bossa edited comment on CASSANDRA-9318 at 7/12/16 8:23 PM:
------------------------------------------------------------------

[~Stefania],

bq. if a message expires before it is sent, we consider this negatively for 
that replica, since we increment the outgoing rate but not the incoming rate 
when the callback expires, and still it may have nothing to do with the replica 
if the message was not sent, it may be due to the coordinator dealing with too 
many messages.

Right, but there isn't much we can do without way more invasive changes. 
Anyway, I don't think that's actually a problem, as if the coordinator is 
overloaded we'll end up generating too many hints and fail with 
{{OverloadedException}} (this time with its original meaning), so we should be 
covered.

bq. I also observe that if a replica has a low rate, then we may block when 
acquiring the limiter, and this will indirectly throttle for all following 
replicas, even if they were ready to receive mutations sooner.

See my answer at the end.

bq. AbstractWriteResponseHandler sets the start time in the constructor, so the 
time spent acquiring a rate limiter for slow replicas counts towards the total 
time before the coordinator throws a write timeout exception.

See my answer at the end.

bq. SP.sendToHintedEndpoints(), we should apply backpressure only if the 
destination is alive.

I know, I'm holding on these changes until we settle on a plan for the whole 
write path (in terms of what to do with CL, the exception to be thrown etc.).

bq. Let's use UnavailableException since WriteFailureException indicates a 
non-timeout failure when processing a mutation, and so it is not appropriate 
for this case. For protocol V4 we cannot change UnavailableException, but for 
V5 we should add a new parameter to it. At the moment it contains 
<cl><required><alive>, we should add the number of overloaded replicas, so that 
drivers can treat the two cases differently.

Does it mean we should advance the protocol version in this issue, or delegate 
to a new issue?

bq. Marking messages as throttled would let the replica know if backpressure 
was enabled, that's true, but it also makes the existing mechanism even more 
complex.

How so? In implementation terms, it should be literally as easy as:
1) Add a byte parameter to {{MessageOut}}.
2) Read such byte parameter from {{MessageIn}} and eventually skip dropping it 
replica-side.
3) If possible (didn't check it), when a "late" response is received on the 
coordinator, try to cancel the related hint.

Do you see any complexity I'm missing there?

bq. dropping mutations that have been in the queue for longer that the RPC 
write timeout is done not only to shed load on the replica, but also to avoid 
wasting resources to perform a mutation when the coordinator has already 
returned a timeout exception to the client.

This is very true and that's why I said it's a bit of a wild idea. Obviously, 
that is true outside of back-pressure, as even now it is possible to return a 
write timeout to clients and still have some or all mutations applied. In the 
end, it might be good to optionally enable such behaviour, as the advantage 
would be increased consistency at the expense of more resource consumption, 
which is a tradeoff some users might want to make, but to be clear, I'm not 
strictly lobbying to implement it, just trying to reason about pros and cons.

bq. I still have concerns regarding additional write timeout exceptions and 
whether an overloaded or slow replica can slow everything down.

These are valid concerns of course, and given similar concerns from [~jbellis], 
I'm working on some changes to avoid write timeouts due to healthy replicas 
unnaturally throttled by unhealthy ones, and depending on [~jbellis] answer to 
my last comment above, maybe only actually back-pressure if the CL is not met.

Stay tuned.


was (Author: sbtourist):
[~Stefania],

bq. if a message expires before it is sent, we consider this negatively for 
that replica, since we increment the outgoing rate but not the incoming rate 
when the callback expires, and still it may have nothing to do with the replica 
if the message was not sent, it may be due to the coordinator dealing with too 
many messages.

Right, but there isn't much we can do without way more invasive changes. 
Anyway, I don't think that's actually a problem, as if the coordinator is 
overloaded we'll end up generating too many hints and fail with 
{{OverloadedException}} (this time with its original meaning), so we should be 
covered.

bq. I also observe that if a replica has a low rate, then we may block when 
acquiring the limiter, and this will indirectly throttle for all following 
replicas, even if they were ready to receive mutations sooner.

See my answer at the end.

bq. AbstractWriteResponseHandler sets the start time in the constructor, so the 
time spent acquiring a rate limiter for slow replicas counts towards the total 
time before the coordinator throws a write timeout exception.

See my answer at the end.

bq. SP.sendToHintedEndpoints(), we should apply backpressure only if the 
destination is alive.

I know, I'm holding on these changes until we settle on a plan for the whole 
write path (in terms of what to do with CL, the exception to be thrown etc.).

bq. Let's use UnavailableException since WriteFailureException indicates a 
non-timeout failure when processing a mutation, and so it is not appropriate 
for this case. For protocol V4 we cannot change UnavailableException, but for 
V5 we should add a new parameter to it. At the moment it contains 
<cl><required><alive>, we should add the number of overloaded replicas, so that 
drivers can treat the
two cases differently.

Does it mean we should advance the protocol version in this issue, or delegate 
to a new issue?

bq. Marking messages as throttled would let the replica know if backpressure 
was enabled, that's true, but it also makes the existing mechanism even more 
complex.

How so? In implementation terms, it should be literally as easy as:
1) Add a byte parameter to {{MessageOut}}.
2) Read such byte parameter from {{MessageIn}} and eventually skip dropping it 
replica-side.
3) If possible (didn't check it), when a "late" response is received on the 
coordinator, try to cancel the related hint.

Do you see any complexity I'm missing there?

bq. dropping mutations that have been in the queue for longer that the RPC 
write timeout is done not only to shed load on the replica, but also to avoid 
wasting resources to perform a mutation when the coordinator has already 
returned a timeout exception to the client.

This is very true and that's why I said it's a bit of a wild idea. Obviously, 
that is true outside of back-pressure, as even now it is possible to return a 
write timeout to clients and still have some or all mutations applied. In the 
end, it might be good to optionally enable such behaviour, as the advantage 
would be increased consistency at the expense of more resource consumption, 
which is a tradeoff some users might want to make, but to be clear, I'm not 
strictly lobbying to implement it, just trying to reason about pros and cons.

bq. I still have concerns regarding additional write timeout exceptions and 
whether an overloaded or slow replica can slow everything down.

These are valid concerns of course, and given similar concerns from [~jbellis], 
I'm working on some changes to avoid write timeouts due to healthy replicas 
unnaturally throttled by unhealthy ones, and depending on [~jbellis] answer to 
my last comment above, maybe only actually back-pressure if the CL is not met.

Stay tuned.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to