[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078645#comment-17078645
 ] 

Kevin Gallardo commented on CASSANDRA-15642:
--------------------------------------------

I see, the differentiation "waiting until we can guarantee we can never 
succeed", I will say it makes more sense presented this way.

Although when testing for CASSANDRA-15543, {{blockFor()}} and 
{{cassandraReplicaCount()}} were both the same, which means we would still fail 
as soon as n >= 1, bringing it back to the same conclusion.

bq. if you want to file a ticket for it I think I can make us both happy: we 
should always fail a query as soon as we know it cannot succeed [...]

I think we still miss out on potential information that could be returned to 
the user and improve usability, but I have presented my arguments already, so I 
won't keep insisting. The case of the schema agreement error is still to me a 
clear situation where things could be improved. But if anything I would hope 
there was be a place where this sort of behavior was documented and explained, 
rather than users having to discover it by themselves in unfortunate 
circumstances, or having to go through the code.

Also I agree it seems to me like a good idea from my POV for the "speculative 
read" (or put more simply a "retry" iiuc?). It would be an improvement, though 
I'm thinking the drivers already provides this sort of utility that are well 
customizable by the users, compared to a server-side solution so I suppose it 
has upsides and downsides. But something that makes completing a request more 
robust seems like a good idea regardless.

> Inconsistent failure messages on distributed queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-15642
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination
>            Reporter: Kevin Gallardo
>            Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and 
> {{AbstractWriteResponseHandler}} could instead wait for all responses to come 
> back before unblocking the wait, or let it timeout. This is way, the users 
> will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen 
> because it fails immediately, and so potentially it hides problems with other 
> replicas. If we were to wait for all responses, we might get a timeout, in 
> that case we'd also be able to tell wether failures have happened *before* 
> that timeout, and have a more complete diagnostic where you can't detect both 
> errors at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to