[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372114#comment-15372114
 ] 

Stefania commented on CASSANDRA-9318:
-------------------------------------

bq. But can that really happen? ResponseVerbHandler returns before incrementing 
back-pressure if the callback is null (i.e. expired), and OutboundTcpConnection 
doesn't even send outbound messages if they're timed out, or am I missing 
something?

You're correct, we won't count twice because the callback is already null. 
However, this raises another point, if a message expires before it is sent, we 
consider this negatively for that replica, since we increment the outgoing rate 
but not the incoming rate when the callback expires, and still it may have 
nothing to do with the replica if the message was not sent, it may be due to 
the coordinator dealing with too many messages.
 
bq. Again, I believe this would make enabling/disabling back-pressure via JMX 
less user friendly.

Fine, let's keep the boolean since it makes life easier for JMX.

bq. I do not think sorting replicas is what we really need, as you have to send 
the mutation to all replicas anyway. I think what you rather need is a way to 
pre-emptively fail if the write consistency level is not met by enough 
"non-overloaded" replicas, i.e.:

You're correct in that the replicas are not sorted in the write path, only in 
the read path. I confused the two yesterday. For sure we need to only fail if 
the write consistency level is not met. I also observe that if a replica has a 
low rate, then we may block when acquiring the limiter, and this will 
indirectly throttle for all following replicas, even if they were ready to 
receive mutations sooner. Therefore, even a single overloaded or slow replica 
may slow the entire write operation. Further, AbstractWriteResponseHandler sets 
the start time in the constructor, so the time spent acquiring a rate limiter 
for slow replicas counts towards the total time before the coordinator throws a 
write timeout exception. So, unless we increase the write RPC timeout or change 
the existing behavior, we may observe write timeout exceptions and, at CL.ANY, 
hints.

Also, in SP.sendToHintedEndpoints(), we should apply backpressure only if the 
destination is alive.
 
{quote}
This leaves us with two options:

    Adding a new exception to the native protocol.
    Reusing a different exception, with WriteFailureException and 
UnavailableException the most likely candidates.

I'm currently leaning towards the latter option.
{quote}

Let's use UnavailableException since WriteFailureException indicates a 
non-timeout failure when processing a mutation, and so it is not appropriate 
for this case. For protocol V4 we cannot change UnavailableException, but for 
V5 we should add a new parameter to it. At the moment it contains 
{{<cl><required><alive>}}, we should add the number of overloaded replicas, so 
that drivers can treat the
two cases differently. Another alternative, as suggested by [~slebresne], is to 
simply consider overloaded replicas as dead and hint them, therefore throwing 
unavailable exceptions as usual, but this is slightly less accurate then 
letting clients know that some replicas were unavailable and some simply 
overloaded.
 
bq. We only need to ensure the coordinator for that specific mutation has 
back-pressure enabled, and we could do this by "marking" the MessageOut with a 
special parameter, what do you think?

Marking messages as throttled would let the replica know if backpressure was 
enabled, that's true, but it also makes the existing mechanism even more 
complex. Also, as far as I understand it, dropping mutations that have been in 
the queue for longer that the RPC write timeout is done not only to shed load 
on the replica, but also to avoid wasting resources to perform a mutation when 
the coordinator has already returned a timeout exception to the client. I think 
this still holds true regardless of backpressure. Since we cannot remove a 
timeout check in the write response handlers, I don't see how it helps to drop 
it replica side. If the message was throttled, even with cross_node_timeout 
enabled, the replica should have time to process it before the RPC write 
timeout expires, so I don't think the extra complexity is justified.

bq. If you all agree with that, I'll move forward and make that change.

To summarize, I agree with this change, provided the drivers can separate the 
two cases (node unavailable vs. node overloaded), which they will be able to do 
with V5 of the native protocol. The alternative, would be to simply consider 
overloaded replicas as dead and hint them. Further, I still have concerns 
regarding additional write timeout exceptions and whether an overloaded or slow 
replica can slow everything down. [~slebresne], [~jbellis] anything else from 
your side? I think Jonathan's proposal of bounding total outstanding requests 
to all replicas, is somewhat different than what Sergio is trying to do but it 
can be discussed further.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to