[ 
https://issues.apache.org/jira/browse/CASSANDRA-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622810#comment-16622810
 ] 

Benedict commented on CASSANDRA-14768:
--------------------------------------

Having discussed this offline with [~aweisberg], we've come to the conclusion 
that the least surprising behaviour of aiming to deliver QUORUM in all DCs if 
we ACK a write, whatever the requested consistency level.  This can be done 
asynchronously, in the callback timeout of the write to any full nodes to a 
given DC.  

So, instead of only writing to transient replicas sufficient to reach the user 
requested consistency level, we will write to transient replicas in all DCs 
with a node down.

Two questions remain to be answered: 

1) What should our behaviour be when we *don't* ACK to the client.  A good 
comparison might be CL.ALL write, which might have reached all but one node - 
leaving a QUORUM in all DCs without TR, but perhaps leaving a single DC with 
inconsistency with TR.  It's not clear how this logically differs from above, 
since in both cases API-wise we have no promise to fulfil, but it does seem 
that at least above we should try to, whereas here it is less clear.

2) If we do try to write to transient replicas in a DC after a full node 
timeout, how many?  Should we aim for a QUORUM in the DC, or should we try to 
write to one extra transient replica for each full replica we failed to contact?

> Transient Replication: Consistency Level Semantics
> --------------------------------------------------
>
>                 Key: CASSANDRA-14768
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14768
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>            Reporter: Benedict
>            Priority: Major
>             Fix For: 4.0
>
>
> For a keyspace without transient replication, we will always attempt (and 
> write hints for) all logical endpoints, including those that seem to be alive 
> but are not responding (or perhaps dropping some messages).  With transient 
> replication, in this scenario we only write to the transient replicas if a 
> certain period of time elapses _and we have not met our consistency level_.
> This doesn’t lead to the same logical behaviour, although technically the 
> guarantees are the same.  In the past, you could expect that all DCs would 
> reach their own local quorum promptly, if say only a single node is failing.  
> Now, you could reach QUORUM with only one DC + 1 remote node, and the remote 
> DC will stay out of whack until repair runs.  This is even worse for e.g. 
> LOCAL_\{QUORUM,ONE\}.
> While the guarantees of the system are the same, the actual behaviour is 
> suboptimal - while the coordinator and remote DCs are healthy, in my opinion 
> we should do our best to ensure each DC reaches its own quorum, just as a 
> normal write would.
> This probably entails having our write callback handle failure to not only 
> write a hint for the endpoint, but also decide if a mutation should 
> immediately be sent to a corresponding transient replica.
> At the very least, we should discuss this before 4.0, even if we opt to take 
> no action before 4.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to