[jira] [Commented] (CASSANDRA-15243) removenode can cause QUORUM write queries to fail

Tom van der Woerdt (JIRA) Mon, 22 Jul 2019 12:08:05 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890403#comment-16890403
 ]


Tom van der Woerdt commented on CASSANDRA-15243:
------------------------------------------------

It's unclear to me what the fix should be here, but here are two thoughts:

* The code could be adapted to only consider acknowledgements from replicas 
with pending replicas if both respond. This way, blockFor could still be 4, and 
the code path could distinguish between 'both replied' vs 'none or one replied'.
* There could be a special-case exception in which pending replicas aren't 
considered in totalBlockFor if both the pending replica and the source replica 
are offline. However, it would be critical to ensure that they are actually 
offline, for if they were to respond anyway there would be a potential 
consistency issue.

> removenode can cause QUORUM write queries to fail
> -------------------------------------------------
>
>                 Key: CASSANDRA-15243
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15243
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Tom van der Woerdt
>            Priority: Normal
>
> Looks like nobody found this yet so this may be a ticking time bomb for 
> some... :(
> This happened to me earlier today. On a Cassandra 3.11.4 cluster with three 
> DCs, one DC had three servers fail due to unexpected external circumstances. 
> Replication was NTS configured with 2:2:2.
> Cassandra dealt with the failures just fine - great! However, the nodes 
> failed in a way that would make bringing them back impossible, so I tried to 
> remove them using 'removenode'.
> Suddenly, the application started experiencing a large number of QUORUM write 
> timeouts. My first reflex was to lower the streaming throughput and 
> compaction throughput, since timeouts indicated some overload was happening. 
> No luck, though.
> I tried a bunch of other things to reroute queries away from the affected 
> datacenter, like changing the Severity field on the dynamic snitch. Still, no 
> luck.
> After a while I noticed one strange thing: the WriteTimeoutException listed 
> that five replicas were required, instead of the four you would expect to see 
> in a 2:2:2 replication configuration. I shrugged it off as some weird 
> inconsistency that was probably because of the use of batches.
> Skip ahead a bit, I decided to let the streams run again and just wait the 
> issue out, since nothing I did was working, and maybe just letting the 
> streams finish would resolve this overload. Magically, as soon as the streams 
> finished, the errors stopped.
> ----
> There are two issues here, both in AbstractWriteResponseHandler.java.
> h3. Cassandra sometimes waits for too many replicas on writes
> In 
> [totalBlockFor|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L124]
>  Cassandra will *always* include pending nodes in `blockfor`. In the case of 
> a quorum query on a 2:2:2 replication factor, with two replicas in one DC 
> down, this results in a blockfor of 5. If the pending replica is then also 
> down (as can happen in a case where removenode is used and not all 
> destination hosts are up), only 4 of the 5 hosts are available, and quorum 
> queries will never succeed.
> h3. UnavailableException not thrown
> While debugging this, I spent all my time focusing on this issue as if it was 
> a timeout. However, Cassandra was doing queries that could never succeed, 
> because insufficient hosts were available. Throwing an UnavailableException 
> would have been more helpful. The issue here is caused by 
> [assureSufficientLiveNodes|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L155]
>  which merely concats the lists of available nodes, and won't consider the 
> special-case behavior of a pending node that's down.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15243) removenode can cause QUORUM write queries to fail

Reply via email to